计算单个列中跨列表的值实例

Question

我有一个 PySpark 数据框，其中 1 列由字符串列表组成。我想计算所有行的每个字符串列表中每个元素的实例数。伪代码：

counter = Counter()
for attr_list in df['attr_list']:
   counter.update(attr_list)

另一种方法是将所有行的所有列表连接起来，并从单个巨大的列表中构建一个计数器。在 PySpark 中是否有有效的方法来执行此操作？

正确的输出将是单个 collections.Counter() 对象，其中填充了所有列中所有列表中每个项目的出现次数，即如果对于给定的列，第 1 行具有列表 ['a', 'b', 'c'] 并且第 2 行有列表 ['b', 'c', 'd']，我们会得到一个看起来像 {'a': 1, 'b': 2, 'c': 2, 'd': 1}.

的计数器

Answer 1

您可以尝试使用rdd的distinct和flatMap方法，为此只需将列转换为rdd并执行这些操作即可。

counter = (df
           .select("attr_list")
           .rdd
           # join all strings in the list and then split to get each word
           .map(lambda x: " ".join(x).split(" ")) 
           .flatMap(lambda x: x)
           # make a tuple for each word so later it can be grouped by to get its frequency count
           .map(lambda x: (x, 1))
           .reduceByKey(lambda a,b: a+b)
           .collectAsMap())

Answer 2

转换为 RDD 的一个选项是将所有数组合并为一个数组，然后在其上使用 Counter 对象。

from collections import Counter
all_lists = df.select('listCol').rdd
print(Counter(all_lists.map(lambda x: [i for i in x[0]]).reduce(lambda x,y: x+y)))

explode 和 groupBy 的另一个选项，并将结果合并到 dictionary。

from pyspark.sql.functions import explode
explode_df = df.withColumn('exploded_list',explode(df.listCol))
counts = explode_df.groupBy('exploded_list').count()
counts_tuple = counts.rdd.reduce(lambda a,b : a+b)
print({counts_tuple[i]:counts_tuple[i+1] for i in range(0,len(counts_tuple)-1,2)})

Answer 3

如果您知道必须计算的 elements，那么您可以将其与 spark2.4+.[=29= 一起使用] 它会非常快。（使用 higher order function filter 和 structs）

df.show() #+------------+ #| atr_list| #+------------+ #|[a, b, b, c]| #| [b, c, d]| #+------------+ elements=['a','b','c','d'] from pyspark.sql import functions as F collected=df.withColumn("struct", F.struct(*[(F.struct(F.expr("size(filter(atr_list,x->x={}))"\ .format("'"+y+"'"))).alias(y)) for y in elements]))\ .select(*[F.sum(F.col("struct.{}.col1".format(x))).alias(x) for x in elements])\ .collect()[0] {elements[i]: [x for x in collected][i] for i in range(len(elements))}

Out: {'a': 1, 'b': 3, 'c': 2, 'd': 1}

第二种方法，使用transform, aggregate, explode and groupby（不需要指定元素）：

from pyspark.sql import functions as F a=df.withColumn("atr", F.expr("""transform(array_distinct(atr_list),x->aggregate(atr_list,0,(acc,y)->\ IF(y=x, acc+1,acc)))"""))\ .withColumn("zip", F.explode(F.arrays_zip(F.array_distinct("atr_list"),("atr"))))\ .select("zip.*").withColumnRenamed("0","elements")\ .groupBy("elements").agg(F.sum("atr").alias("sum"))\ .collect() {a[i][0]: a[i][1] for i in range(len(a))}

计算单个列中跨列表的值实例

Counting instaces of values across lists within a single column

python

counter

apache-spark

pyspark