为什么 PySpark 中的 agg() 一次只能汇总一列?
Why agg() in PySpark is only able to summarize one column at a time?
对于下面的数据框
df=spark.createDataFrame(data=[('Alice',4.300),('Bob',7.677)],schema=['name','High'])
当我试图找到最小值和最大值时,我在输出中只得到最小值。
df.agg({'High':'max','High':'min'}).show()
+-----------+
|min(High) |
+-----------+
| 2094900|
+-----------+
为什么 agg() 不能像 Pandas 那样同时给出最大值和最小值?
如你所见here:
agg(*exprs)
Compute aggregates and returns the result as a DataFrame.
The available aggregate functions are avg, max, min, sum, count.
If exprs is a single dict mapping from string to string, then the key is the column to perform aggregation on, and the value is the aggregate function.
Alternatively, exprs can also be a list of aggregate Column expressions.
Parameters: exprs – a dict mapping from column name (string) to aggregate functions (string), or a list of Column.
您可以使用列列表并在每一列上应用您需要的函数,如下所示:
>>> from pyspark.sql import functions as F
>>> df.agg(F.min(df.High),F.max(df.High),F.avg(df.High),F.sum(df.High)).show()
+---------+---------+---------+---------+
|min(High)|max(High)|avg(High)|sum(High)|
+---------+---------+---------+---------+
| 4.3| 7.677| 5.9885| 11.977|
+---------+---------+---------+---------+
对于下面的数据框
df=spark.createDataFrame(data=[('Alice',4.300),('Bob',7.677)],schema=['name','High'])
当我试图找到最小值和最大值时,我在输出中只得到最小值。
df.agg({'High':'max','High':'min'}).show()
+-----------+
|min(High) |
+-----------+
| 2094900|
+-----------+
为什么 agg() 不能像 Pandas 那样同时给出最大值和最小值?
如你所见here:
agg(*exprs)
Compute aggregates and returns the result as a DataFrame.
The available aggregate functions are avg, max, min, sum, count.
If exprs is a single dict mapping from string to string, then the key is the column to perform aggregation on, and the value is the aggregate function.
Alternatively, exprs can also be a list of aggregate Column expressions.
Parameters: exprs – a dict mapping from column name (string) to aggregate functions (string), or a list of Column.
您可以使用列列表并在每一列上应用您需要的函数,如下所示:
>>> from pyspark.sql import functions as F
>>> df.agg(F.min(df.High),F.max(df.High),F.avg(df.High),F.sum(df.High)).show()
+---------+---------+---------+---------+
|min(High)|max(High)|avg(High)|sum(High)|
+---------+---------+---------+---------+
| 4.3| 7.677| 5.9885| 11.977|
+---------+---------+---------+---------+