分组表达式的相关子查询 - TreeNodeException: Binding attribute, tree: count(1)#382L
Correlated subquery on grouped expression - TreeNodeException: Binding attribute, tree: count(1)#382L
假设我正在尝试对一些由对(a 和 b 值)组成的样本数据进行统计。有些对存在多次,有些则不存在。
spark.createDataFrame([
Row(a=5, b=10), Row(a=5, b=10), Row(a=5, b=10),
Row(a=6, b=10), Row(a=6, b=10), Row(a=6, b=10), Row(a=6, b=10), Row(a=6, b=10), Row(a=6, b=10),
Row(a=5, b=11), Row(a=5, b=11),
Row(a=6, b=12), Row(a=6, b=12), Row(a=6, b=12), Row(a=6, b=12),
Row(a=5, b=5), Row(a=5, b=5), Row(a=5, b=5), Row(a=5, b=5), Row(a=5, b=5), Row(a=5, b=5), Row(a=5, b=5),
]).registerTempTable('mydata')
首先,我只是计算每对存在的频率:
spark.sql('''
SELECT a, b,
COUNT(*) as count
FROM mydata AS o
GROUP BY a, b
''').show()
输出:
+---+---+-----+
| a| b|count|
+---+---+-----+
| 6| 12| 4|
| 5| 5| 7|
| 6| 10| 6|
| 5| 10| 3|
| 5| 11| 2|
+---+---+-----+
现在,我想添加一个额外的列,其中包含一对存在的频率与具有相同 a 值的对总数的百分比。为此,我尝试添加一个计算总数的相关子查询:
spark.sql('''
SELECT a, b,
COUNT(*) as count,
(COUNT(*) / (
SELECT COUNT(*) FROM mydata AS i WHERE o.a = i.a
)) as percentage
FROM mydata AS o
GROUP BY a, b
''').show()
我期待的是:
+---+---+-----+----------+
| a| b|count|percentage|
+---+---+-----+----------+
| 6| 12| 4| 0.4| --> 10 pairs exist with a=6 --> 4/10 = 0.4
| 5| 5| 7| 0.5833| --> 12 pairs exist with a=5 --> 7/12 =0.5833
| 6| 10| 6| 0.6| --> ...
| 5| 10| 3| 0.25|
| 5| 11| 2| 0.1666|
+---+---+-----+----------+
我得到的是:
py4j.protocol.Py4JJavaError: An error occurred while calling o371.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: count(1)#382L
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference.applyOrElse(BoundAttribute.scala:91)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference.applyOrElse(BoundAttribute.scala:90)
[...]
Caused by: java.lang.RuntimeException: Couldn't find count(1)#382L in [a#305L,b#306L,count(1)#379L]
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$$anonfun$applyOrElse.apply(BoundAttribute.scala:97)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$$anonfun$applyOrElse.apply(BoundAttribute.scala:91)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
... 80 more
这听起来有点令人困惑 - pyspark 想以某种方式访问内部连接的计数?
我的子查询语法有问题吗?
从第一个table开始,可以用window函数计算百分比; sum(count) over (partition by a)
将 count
与 a
相加,结果的长度不会减少,这允许您直接除以另一列:
spark.sql('''
SELECT a, b,
COUNT(*) as count
FROM mydata AS o
GROUP BY a, b
''').registerTempTable('count')
spark.sql('''
SELECT *,
count / sum(count) over (partition by a) as percentage
FROM count
''').show()
+---+---+-----+-------------------+
| a| b|count| percentage|
+---+---+-----+-------------------+
| 6| 12| 4| 0.4|
| 6| 10| 6| 0.6|
| 5| 5| 7| 0.5833333333333334|
| 5| 10| 3| 0.25|
| 5| 11| 2|0.16666666666666666|
+---+---+-----+-------------------+
假设我正在尝试对一些由对(a 和 b 值)组成的样本数据进行统计。有些对存在多次,有些则不存在。
spark.createDataFrame([
Row(a=5, b=10), Row(a=5, b=10), Row(a=5, b=10),
Row(a=6, b=10), Row(a=6, b=10), Row(a=6, b=10), Row(a=6, b=10), Row(a=6, b=10), Row(a=6, b=10),
Row(a=5, b=11), Row(a=5, b=11),
Row(a=6, b=12), Row(a=6, b=12), Row(a=6, b=12), Row(a=6, b=12),
Row(a=5, b=5), Row(a=5, b=5), Row(a=5, b=5), Row(a=5, b=5), Row(a=5, b=5), Row(a=5, b=5), Row(a=5, b=5),
]).registerTempTable('mydata')
首先,我只是计算每对存在的频率:
spark.sql('''
SELECT a, b,
COUNT(*) as count
FROM mydata AS o
GROUP BY a, b
''').show()
输出:
+---+---+-----+
| a| b|count|
+---+---+-----+
| 6| 12| 4|
| 5| 5| 7|
| 6| 10| 6|
| 5| 10| 3|
| 5| 11| 2|
+---+---+-----+
现在,我想添加一个额外的列,其中包含一对存在的频率与具有相同 a 值的对总数的百分比。为此,我尝试添加一个计算总数的相关子查询:
spark.sql('''
SELECT a, b,
COUNT(*) as count,
(COUNT(*) / (
SELECT COUNT(*) FROM mydata AS i WHERE o.a = i.a
)) as percentage
FROM mydata AS o
GROUP BY a, b
''').show()
我期待的是:
+---+---+-----+----------+
| a| b|count|percentage|
+---+---+-----+----------+
| 6| 12| 4| 0.4| --> 10 pairs exist with a=6 --> 4/10 = 0.4
| 5| 5| 7| 0.5833| --> 12 pairs exist with a=5 --> 7/12 =0.5833
| 6| 10| 6| 0.6| --> ...
| 5| 10| 3| 0.25|
| 5| 11| 2| 0.1666|
+---+---+-----+----------+
我得到的是:
py4j.protocol.Py4JJavaError: An error occurred while calling o371.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: count(1)#382L
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference.applyOrElse(BoundAttribute.scala:91)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference.applyOrElse(BoundAttribute.scala:90)
[...]
Caused by: java.lang.RuntimeException: Couldn't find count(1)#382L in [a#305L,b#306L,count(1)#379L]
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$$anonfun$applyOrElse.apply(BoundAttribute.scala:97)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$$anonfun$applyOrElse.apply(BoundAttribute.scala:91)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
... 80 more
这听起来有点令人困惑 - pyspark 想以某种方式访问内部连接的计数?
我的子查询语法有问题吗?
从第一个table开始,可以用window函数计算百分比; sum(count) over (partition by a)
将 count
与 a
相加,结果的长度不会减少,这允许您直接除以另一列:
spark.sql('''
SELECT a, b,
COUNT(*) as count
FROM mydata AS o
GROUP BY a, b
''').registerTempTable('count')
spark.sql('''
SELECT *,
count / sum(count) over (partition by a) as percentage
FROM count
''').show()
+---+---+-----+-------------------+
| a| b|count| percentage|
+---+---+-----+-------------------+
| 6| 12| 4| 0.4|
| 6| 10| 6| 0.6|
| 5| 5| 7| 0.5833333333333334|
| 5| 10| 3| 0.25|
| 5| 11| 2|0.16666666666666666|
+---+---+-----+-------------------+