我如何获得pyspark中的百分比频率
How can I obtain percentage frequencies in pyspark
我正在尝试获取 pyspark 中的百分比频率。我在 python 中做了如下
Companies = df['Company'].value_counts(normalize = True)
获取频率非常简单:
# Dates in descending order of complaint frequency
df.createOrReplaceTempView('Comp')
CompDF = spark.sql("SELECT Company, count(*) as cnt \
FROM Comp \
GROUP BY Company \
ORDER BY cnt DESC")
CompDF.show()
+--------------------+----+
| Company| cnt|
+--------------------+----+
|BANK OF AMERICA, ...|1387|
| EQUIFAX, INC.|1285|
|WELLS FARGO & COM...|1119|
|Experian Informat...|1115|
|TRANSUNION INTERM...|1001|
|JPMORGAN CHASE & CO.| 905|
| CITIBANK, N.A.| 772|
|OCWEN LOAN SERVIC...| 481|
如何从此处获得频率百分比?我尝试了很多事情,但运气不佳。
任何帮助将不胜感激。
修改 SQL 查询可能会得到您想要的结果。
"SELECT Company,cnt/(SELECT SUM(cnt) from (SELECT Company, count(*) as cnt
FROM Comp GROUP BY Company ORDER BY cnt DESC) temp_tab) sum_freq from
(SELECT Company, count(*) as cnt FROM Comp GROUP BY Company ORDER BY cnt
DESC)"
正如 Suresh 在评论中暗示的那样,假设 total_count
是数据框 Companies
中的行数,您可以使用 withColumn
添加一个名为 [=14= 的新列] 在 CompDF
:
total_count = Companies.count()
df = CompDF.withColumn('percentage', CompDF.cnt/float(total_counts))
我正在尝试获取 pyspark 中的百分比频率。我在 python 中做了如下
Companies = df['Company'].value_counts(normalize = True)
获取频率非常简单:
# Dates in descending order of complaint frequency
df.createOrReplaceTempView('Comp')
CompDF = spark.sql("SELECT Company, count(*) as cnt \
FROM Comp \
GROUP BY Company \
ORDER BY cnt DESC")
CompDF.show()
+--------------------+----+
| Company| cnt|
+--------------------+----+
|BANK OF AMERICA, ...|1387|
| EQUIFAX, INC.|1285|
|WELLS FARGO & COM...|1119|
|Experian Informat...|1115|
|TRANSUNION INTERM...|1001|
|JPMORGAN CHASE & CO.| 905|
| CITIBANK, N.A.| 772|
|OCWEN LOAN SERVIC...| 481|
如何从此处获得频率百分比?我尝试了很多事情,但运气不佳。 任何帮助将不胜感激。
修改 SQL 查询可能会得到您想要的结果。
"SELECT Company,cnt/(SELECT SUM(cnt) from (SELECT Company, count(*) as cnt
FROM Comp GROUP BY Company ORDER BY cnt DESC) temp_tab) sum_freq from
(SELECT Company, count(*) as cnt FROM Comp GROUP BY Company ORDER BY cnt
DESC)"
正如 Suresh 在评论中暗示的那样,假设 total_count
是数据框 Companies
中的行数,您可以使用 withColumn
添加一个名为 [=14= 的新列] 在 CompDF
:
total_count = Companies.count()
df = CompDF.withColumn('percentage', CompDF.cnt/float(total_counts))