DataFrame.write.parquet - HIVE 或 Impala 无法读取 Parquet 文件

Question

我使用以下命令将一个带有 pySpark 的 DataFrame 写入 HDFS：

df.repartition(col("year"))\
.write.option("maxRecordsPerFile", 1000000)\
.parquet('/path/tablename', mode='overwrite', partitionBy=["year"], compression='snappy')

查看 HDFS 时，我可以看到文件正确地放置在那里。无论如何，当我尝试使用 HIVE 或 Impala 读取 table 时，无法找到 table。

这是怎么回事，我是不是遗漏了什么？

有趣的是，df.write.format('parquet').saveAsTable("tablename") 工作正常。

Answer 1

这是 spark 的预期行为：

df...etc.parquet("") 将数据写入 HDFS 位置并且不会在 Hive 中创建任何 table。
但是df..saveAsTable("")在配置单元中创建table并向其中写入数据。

In the case the table already exists, behavior of this function depends on the save mode, specified by the mode function (default to throwing an exception). When mode is Overwrite, the schema of the DataFrame does not need to be the same as that of the existing table.

这就是为什么你在执行 df...parquet("")

后 not able to find table in hive 的原因

DataFrame.write.parquet - HIVE 或 Impala 无法读取 Parquet 文件

DataFrame.write.parquet - Parquet-file cannot be read by HIVE or Impala

python

hive

apache-spark

parquet

pyspark