将 for 循环的内容存储在列表中 python

Question

这是一个用 pyspark ipython notebook 编写的 python 程序。我正在尝试使用 for 循环计算每个 RDD（可以视为文件）中列表 'names' 中给出的单词实例数。我想将每个文件中的单词计数存储在一个列表中，该列表具有同名的单词。

例如。假设第一个 RDD 中单词 harry 的计数为 1214，第二个 RDD 中的单词 harry 为 1506 n，依此类推。我想创建一个列表哈里名单 = [1214, 1506, 1825, 2933, 3748, 2617, 2887]

姓名列表是动态的。

names = ['harry', 'hermione','ron','hagrid']
rdds = [hp1RDD,hp2RDD,hp3RDD,hp4RDD,hp5RDD,hp6RDD,hp7RDD]

for n in names:
    a = []


    for x in rdds:
        a.append(x.flatMap(lambda line: line.split(" ")).filter(lambda word: word==n).count())

    print a

使用上面的代码我可以打印列表的内容，但是我不能按照上面显示的方式保存它。

Answer 1

如果您不介意拥有：

像 hagrid's 这样的词独立于 hagrid

使用collections.Counter会有所帮助：

from collections import Counter

hp1RDD = "harry potter has a girlfriend who's name is hermione granger and a friend called ron. harry has an uncle who's name is hagrid. hagrid is a big guy"
hp2RDD = "harry potter is the best movie I've ever saw. hermione is very beautfiful"

names = ['harry', 'hermione','ron','hagrid']
rdds = [hp1RDD, hp2RDD]
results = dict()

for name in names:
    tmp_list = list()

    for rdd in rdds:
        count = Counter(rdd.split())
        tmp_list.append(count[name])
    results[name] = tmp_list

print results

此外，您可以使用不区分大小写的版本，只需使用 lower():

count = Counter([x.lower() for x in rdd.split()])

将 for 循环的内容存储在列表中 python

Storing the content from for loop in the list python

python

loops

list

rdd

pyspark