在pyspark中怎么算？

Question

我有一长串标题。我想计算整个数据集中的每个标题。例如：

`title`

   A
   b
   A
   c
   c
   c

输出：

 title fre
     A   2
     b   1
     c   3

Answer 1

嗨，你可以做到这一点

 import pandas as pd
 title=["A","b","A","c","c","c"]
 pd.Series(title).value_counts()

Answer 2

您可以 groupBy title 然后 count:

import pyspark.sql.functions as f
df.groupBy('title').agg(f.count('*').alias('count')).show()
+-----+-----+
|title|count|
+-----+-----+
|    A|    2|
|    c|    3|
|    b|    1|
+-----+-----+

或更简洁：

df.groupBy('title').count().show()

+-----+-----+
|title|count|
+-----+-----+
|    A|    2|
|    c|    3|
|    b|    1|
+-----+-----+

在pyspark中怎么算？

how count in pyspark?

hadoop

count

pyspark