使用 pyspark 对文本文件中的值集合进行排序并将排序后的值保存回文本文件

Question

我正在尝试对包含以下格式记录的文本文件进行排序 input.txt:

b1 xy
a2 pq

这是我的 pySpark 代码：

distFile = sc.textFile(input.txt)
words = distFile.map(lambda x: [x[:2],x[2:]])
words.saveAsTextFile("output")

这就是我得到的输出内容

output/part-00000

[u'a2', u'pq']
[u'b1', u'xy']

我想要的内容是：

a2 pq
b1 xy

我做错了什么？

我在使用 words.saveAsPickleFile("output")

时也会得到垃圾值

Answer 1

您需要将所有 strings 合并为一个。类似于：

rdd = sc.parallelize([("Roger", "Andrew"),
                      ("Melissa", "Goldsmith")])

words = rdd.map(lambda (n, ln): n + " " + ln)

words.repartition(1).saveAsTextFile("output")

这是结果：

使用 pyspark 对文本文件中的值集合进行排序并将排序后的值保存回文本文件

sorting collection of values from a text file and saving sorted values back to text file with pyspark

bigdata

apache-spark

rdd

pyspark