如何将多个文本文件读入pyspark中的数据框

how to read multiple text files into a dataframe in pyspark

我在目录中有几个包含 json 数据的 txt 文件(我只有路径,没有文件名),我需要将它们全部读入数据框。

我试过这个:

df=sc.wholeTextFiles("path/*")

但我什至无法显示数据,我的主要目标是以不同方式对数据执行查询。

而不是wholeTextFiles(给出键值对,键作为文件名,数据作为值),

尝试使用 read.json 并提供您的目录名称 spark 会将目录中的所有文件读入数据框。

df=spark.read.json("<directorty_path>/*")
df.show()

From docs:

wholeTextFiles(path, minPartitions=None, use_unicode=True)

Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

Note: Small files are preferred, as each file will be loaded fully in memory.