Jupyter Notebook 中的 PySpark:'Column' 对象不可调用

PySpark in Jupyter Notebook: 'Column' object is not callable

我正在 运行 分析有关奥运会表现的数据,并想创建一个概述哪些运动员获得了最多的奖牌。首先,我创建了额外的列,因为在原始数据集中,赢得的奖牌由字符串(“Gold”、“Silver”等)或 NA 表示。

totalDF = olympicDF.count()
medalswonDF = olympicDF\
   .where(col("Medal")!="NA")\
.withColumn("Gold", when(col("Medal")== "Gold",("1")))\
.withColumn("Silver", when(col("Medal")== "Silver",("1")))\
.withColumn("Bronze", when(col("Medal")== "Bronze",("1")))\
.withColumn("Total", when(col("Medal")!=  "NA", ("1"))) -> the  "1" is just a placeholder for now

下一步我想显示 table 最成功的 25 名运动员(根据获得的奖牌数)

medalswonDF.cache() # optimization to make the processing faster

medalswonDF.where(col("Medal")!="NA")\
                     .select("Name", "Gold", "Silver", "Bronze")\
                     .groupBy("Name")\
                     .agg(count("Gold")),\
                          (count("Silver")),\
                            (count("Bronze"))\
.orderBy("Gold").desc()\
.select("Name", "Gold", "Silver", "Bronze").show(25,True)

但是,我不断收到错误消息“TypeError:'Column' 对象不可调用”。我知道如果你想应用一个不能应用到列的函数就是这种情况,但据我了解,这不应该是这里的原因。

供参考的架构:

root
 |-- ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Height: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Team: string (nullable = true)
 |-- NOC: string (nullable = true)
 |-- Games: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Season: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Sport: string (nullable = true)
 |-- Event: string (nullable = true)
 |-- Medal: string (nullable = true)
 |-- Gold: string (nullable = true)
 |-- Silver: string (nullable = true)
 |-- Bronze: string (nullable = true)
 |-- Total: string (nullable = true)

我做错了什么?

您在需要关闭 agg 之前使用了额外的括号来关闭它。

更改代码如下所示,

medalswonDF.where(col("Medal")!="NA")\
                 .select("Name", "Gold", "Silver", "Bronze")\
                 .groupBy("Name")\
                 .agg(count("Gold").alias("Gold_count"),
                      count("Silver").alias("Silver_count"),
                      count("Bronze").alias("Bronze_count")) \
                 .orderBy("Gold_count").desc()\
                 .select("Name", "Gold_count", "Silver_count", "Bronze_count").show(25,True)