Spark 使用 Python ：将 RDD 输出保存到文本文件中

Question

我正在使用 python 尝试 spark 中的字数统计问题。但是当我尝试使用 .saveAsTextFile 命令将输出 RDD 保存在文本文件中时，我遇到了问题。这是我的代码。请帮我。我卡住了。感谢您的宝贵时间。

import re

from pyspark import SparkConf , SparkContext

def normalizewords(text):
    return re.compile(r'\W+',re.UNICODE).split(text.lower())

conf=SparkConf().setMaster("local[2]").setAppName("sorted result")
sc=SparkContext(conf=conf)

input=sc.textFile("file:///home/cloudera/PythonTask/sample.txt")

words=input.flatMap(normalizewords)

wordsCount=words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)

sortedwordsCount=wordsCount.map(lambda (x,y):(y,x)).sortByKey()

results=sortedwordsCount.collect()

for result in results:
    count=str(result[0])
    word=result[1].encode('ascii','ignore')

    if(word):
        print word +"\t\t"+ count

results.saveAsTextFile("/var/www/myoutput")

Answer 1

因为你收集了 results=sortedwordsCount.collect() 所以，它不是 RDD。它将是正常的 python 列表或元组。

如您所知，list 是 python object/data 结构，append 是添加元素的方法。

>>> x = []
>>> x.append(5)
>>> x
[5]

Similarly RDD is sparks object/data structure and saveAsTextFile is method to write the file. Important thing is its distributed data structure.

因此，我们不能在 RDD 上使用 append 或在列表上使用 saveAsTextFile。 collect 是 RDD 上获取 RDD 到驱动程序内存的方法。

如评论中所述，使用 saveAsTextFile 保存 sortedwordsCount 或在 python 中打开文件并使用 results 写入文件

Answer 2

将 results=sortedwordsCount.collect() 更改为 results=sortedwordsCount，因为使用 .collect() 结果将是一个列表。

Spark 使用 Python ：将 RDD 输出保存到文本文件中

Spark using Python : save RDD output into text files

python

apache-spark

pyspark