Spark Cosmos db 连接器正在删除大多数行为空的列

Question

我正在尝试使用以下方法使用 spark cosmos 连接器从 cosmos db 读取 30K 行数据代码

val readConfig = Config(Map(
  "Endpoint" -> "",
  "Masterkey" -> "",
  "Database" -> "",
  "Collection" -> "",
  "PreferredRegions" -> "",
   "query_custom" -> """SELECT t.id,t.gender,t.loc from Tab t"""
 ))

val df = spark.read.cosmosDB(readConfig)

在 30k 中，只有 2 行的“loc”列具有非空值。但由于某种原因，连接器在最终数据框中完全删除了“loc”列，最终数据框给出了以下模式

df.printSchema
root
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)

有人可以帮助我如何将“loc”列包含在我的最终数据框中。

Answer 1

当您在读取时未指定架构时，Spark Connector 需要对其进行推断。为了推断它，它对文档进行采样并基于它们创建模式。

问题可能是采样文档没有这个属性（你说30K上只有2个有），所以生成schema读取完整数据的时候，显然没有。

直接在读取调用上提供架构可以解决这个问题。或者，您可以通过增加 schema_samplesize.

来自定义采样大小（参考 https://github.com/Azure/azure-cosmosdb-spark/blob/19561f0d42eaa91f9e4793fbdf30b62b22829868/src/main/scala/com/microsoft/azure/cosmosdb/spark/config/CosmosDBConfig.scala#L47，默认 1000）

Spark Cosmos db 连接器正在删除大多数行为空的列

Spark Cosmos db connector is dropping columns where majority of rows are null

azure

apache-spark

azure-cosmosdb

azure-databricks