使用 PySpark 计算出现次数
Using PySpark to Count Number of Occurrences
我有一个 PairedRDD,它包含文档 ID 作为键,以及该文档中的单词列表作为值。
例如
DocID
Words
001
["quick","brown","fox","lazy","fox"]
002
["banana","apple","apple","banana","fox"]
我设法做了一个 mapValues 这样:
DocID
Words
001
[("quick",1),("brown",1),("fox",1),("lazy",1),("fox",1)]
002
[("banana",1),("apple",1),("apple",1),("banana",1),("fox",1)]
有没有办法只对单词执行 ReduceByKey()?
DocID
Words
001
[("quick",1),("brown",1),("fox",2),("lazy",1)]
002
[("banana",2),("apple",2),("fox",1)]
我仍然需要维护结构,以便仅在每个文档中应用计数。
您可以使用collections.Counter
统计每篇文档的字数:
from collections import Counter
rdd = sc.parallelize([
("001", ["quick","brown","fox","lazy","fox"]),
("002", ["banana","apple","apple","banana","fox"])
])
counted = rdd.mapValues(lambda x: list(zip(Counter(x).keys(), Counter(x).values())))
counted.collect()
# [('001', [('quick', 1), ('brown', 1), ('fox', 2), ('lazy', 1)]),
# ('002', [('banana', 2), ('apple', 2), ('fox', 1)])]
另一种RDD方式:
from operator import add
result = rdd.flatMapValues(lambda x: x) \
.map(lambda x: (x,1)) \
.reduceByKey(add) \
.map(lambda x: (x[0][0], [(x[0][1], x[1])])) \
.reduceByKey(add)
result.collect()
#[('002', [('banana', 2), ('apple', 2), ('fox', 1)]),
# ('001', [('brown', 1), ('fox', 2), ('lazy', 1), ('quick', 1)])]
我有一个 PairedRDD,它包含文档 ID 作为键,以及该文档中的单词列表作为值。 例如
DocID | Words |
---|---|
001 | ["quick","brown","fox","lazy","fox"] |
002 | ["banana","apple","apple","banana","fox"] |
我设法做了一个 mapValues 这样:
DocID | Words |
---|---|
001 | [("quick",1),("brown",1),("fox",1),("lazy",1),("fox",1)] |
002 | [("banana",1),("apple",1),("apple",1),("banana",1),("fox",1)] |
有没有办法只对单词执行 ReduceByKey()?
DocID | Words |
---|---|
001 | [("quick",1),("brown",1),("fox",2),("lazy",1)] |
002 | [("banana",2),("apple",2),("fox",1)] |
我仍然需要维护结构,以便仅在每个文档中应用计数。
您可以使用collections.Counter
统计每篇文档的字数:
from collections import Counter
rdd = sc.parallelize([
("001", ["quick","brown","fox","lazy","fox"]),
("002", ["banana","apple","apple","banana","fox"])
])
counted = rdd.mapValues(lambda x: list(zip(Counter(x).keys(), Counter(x).values())))
counted.collect()
# [('001', [('quick', 1), ('brown', 1), ('fox', 2), ('lazy', 1)]),
# ('002', [('banana', 2), ('apple', 2), ('fox', 1)])]
另一种RDD方式:
from operator import add
result = rdd.flatMapValues(lambda x: x) \
.map(lambda x: (x,1)) \
.reduceByKey(add) \
.map(lambda x: (x[0][0], [(x[0][1], x[1])])) \
.reduceByKey(add)
result.collect()
#[('002', [('banana', 2), ('apple', 2), ('fox', 1)]),
# ('001', [('brown', 1), ('fox', 2), ('lazy', 1), ('quick', 1)])]