Jupyter Notebook 中的 PySpark:'Column' 对象不可调用
PySpark in Jupyter Notebook: 'Column' object is not callable
我正在 运行 分析有关奥运会表现的数据,并想创建一个概述哪些运动员获得了最多的奖牌。首先,我创建了额外的列,因为在原始数据集中,赢得的奖牌由字符串(“Gold”、“Silver”等)或 NA 表示。
totalDF = olympicDF.count()
medalswonDF = olympicDF\
.where(col("Medal")!="NA")\
.withColumn("Gold", when(col("Medal")== "Gold",("1")))\
.withColumn("Silver", when(col("Medal")== "Silver",("1")))\
.withColumn("Bronze", when(col("Medal")== "Bronze",("1")))\
.withColumn("Total", when(col("Medal")!= "NA", ("1"))) -> the "1" is just a placeholder for now
下一步我想显示 table 最成功的 25 名运动员(根据获得的奖牌数)
medalswonDF.cache() # optimization to make the processing faster
medalswonDF.where(col("Medal")!="NA")\
.select("Name", "Gold", "Silver", "Bronze")\
.groupBy("Name")\
.agg(count("Gold")),\
(count("Silver")),\
(count("Bronze"))\
.orderBy("Gold").desc()\
.select("Name", "Gold", "Silver", "Bronze").show(25,True)
但是,我不断收到错误消息“TypeError:'Column' 对象不可调用”。我知道如果你想应用一个不能应用到列的函数就是这种情况,但据我了解,这不应该是这里的原因。
供参考的架构:
root
|-- ID: integer (nullable = true)
|-- Name: string (nullable = true)
|-- Sex: string (nullable = true)
|-- Age: integer (nullable = true)
|-- Height: integer (nullable = true)
|-- Weight: integer (nullable = true)
|-- Team: string (nullable = true)
|-- NOC: string (nullable = true)
|-- Games: string (nullable = true)
|-- Year: string (nullable = true)
|-- Season: string (nullable = true)
|-- City: string (nullable = true)
|-- Sport: string (nullable = true)
|-- Event: string (nullable = true)
|-- Medal: string (nullable = true)
|-- Gold: string (nullable = true)
|-- Silver: string (nullable = true)
|-- Bronze: string (nullable = true)
|-- Total: string (nullable = true)
我做错了什么?
您在需要关闭 agg 之前使用了额外的括号来关闭它。
更改代码如下所示,
medalswonDF.where(col("Medal")!="NA")\
.select("Name", "Gold", "Silver", "Bronze")\
.groupBy("Name")\
.agg(count("Gold").alias("Gold_count"),
count("Silver").alias("Silver_count"),
count("Bronze").alias("Bronze_count")) \
.orderBy("Gold_count").desc()\
.select("Name", "Gold_count", "Silver_count", "Bronze_count").show(25,True)
我正在 运行 分析有关奥运会表现的数据,并想创建一个概述哪些运动员获得了最多的奖牌。首先,我创建了额外的列,因为在原始数据集中,赢得的奖牌由字符串(“Gold”、“Silver”等)或 NA 表示。
totalDF = olympicDF.count()
medalswonDF = olympicDF\
.where(col("Medal")!="NA")\
.withColumn("Gold", when(col("Medal")== "Gold",("1")))\
.withColumn("Silver", when(col("Medal")== "Silver",("1")))\
.withColumn("Bronze", when(col("Medal")== "Bronze",("1")))\
.withColumn("Total", when(col("Medal")!= "NA", ("1"))) -> the "1" is just a placeholder for now
下一步我想显示 table 最成功的 25 名运动员(根据获得的奖牌数)
medalswonDF.cache() # optimization to make the processing faster
medalswonDF.where(col("Medal")!="NA")\
.select("Name", "Gold", "Silver", "Bronze")\
.groupBy("Name")\
.agg(count("Gold")),\
(count("Silver")),\
(count("Bronze"))\
.orderBy("Gold").desc()\
.select("Name", "Gold", "Silver", "Bronze").show(25,True)
但是,我不断收到错误消息“TypeError:'Column' 对象不可调用”。我知道如果你想应用一个不能应用到列的函数就是这种情况,但据我了解,这不应该是这里的原因。
供参考的架构:
root
|-- ID: integer (nullable = true)
|-- Name: string (nullable = true)
|-- Sex: string (nullable = true)
|-- Age: integer (nullable = true)
|-- Height: integer (nullable = true)
|-- Weight: integer (nullable = true)
|-- Team: string (nullable = true)
|-- NOC: string (nullable = true)
|-- Games: string (nullable = true)
|-- Year: string (nullable = true)
|-- Season: string (nullable = true)
|-- City: string (nullable = true)
|-- Sport: string (nullable = true)
|-- Event: string (nullable = true)
|-- Medal: string (nullable = true)
|-- Gold: string (nullable = true)
|-- Silver: string (nullable = true)
|-- Bronze: string (nullable = true)
|-- Total: string (nullable = true)
我做错了什么?
您在需要关闭 agg 之前使用了额外的括号来关闭它。
更改代码如下所示,
medalswonDF.where(col("Medal")!="NA")\
.select("Name", "Gold", "Silver", "Bronze")\
.groupBy("Name")\
.agg(count("Gold").alias("Gold_count"),
count("Silver").alias("Silver_count"),
count("Bronze").alias("Bronze_count")) \
.orderBy("Gold_count").desc()\
.select("Name", "Gold_count", "Silver_count", "Bronze_count").show(25,True)