如何将多个文本文件读入pyspark中的数据框

Question

我在目录中有几个包含 json 数据的 txt 文件（我只有路径，没有文件名），我需要将它们全部读入数据框。

我试过这个：

df=sc.wholeTextFiles("path/*")

但我什至无法显示数据，我的主要目标是以不同方式对数据执行查询。

Answer 1

而不是wholeTextFiles（给出键值对，键作为文件名，数据作为值），

尝试使用 read.json 并提供您的目录名称 spark 会将目录中的所有文件读入数据框。

df=spark.read.json("<directorty_path>/*")
df.show()

From docs:

wholeTextFiles(path, minPartitions=None, use_unicode=True)

Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

Note: Small files are preferred, as each file will be loaded fully in memory.

如何将多个文本文件读入pyspark中的数据框

how to read multiple text files into a dataframe in pyspark

sql

dataframe

rdd

pyspark

databricks