MySQL 对包含空值的 window 求和 returns null
MySQL sum over a window that contains a null value returns null
我正在尝试获取每个客户过去 3 个月行(不包括当前行)的收入总和。当前在 Databricks 中尝试的最小示例:
cols = ['Client','Month','Revenue']
df_pd = pd.DataFrame([['A',201701,100],
['A',201702,101],
['A',201703,102],
['A',201704,103],
['A',201705,104],
['B',201701,201],
['B',201702,np.nan],
['B',201703,203],
['B',201704,204],
['B',201705,205],
['B',201706,206],
['B',201707,207]
])
df_pd.columns = cols
spark_df = spark.createDataFrame(df_pd)
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql
""")
df_out.show()
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| NaN| 201.0|
| B|201703| 203.0| NaN|
| B|201704| 204.0| NaN|
| B|201705| 205.0| NaN|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
如您所见,如果第 3 个月 window 中的任何位置存在空值,则返回空值。我想将空值视为 0,因此进行了 ifnull 尝试,但这似乎不起作用。我也试过用 case 语句将 NULL 更改为 0,但没有成功。
只是 coalesce
外和:
df_out = sqlContext.sql("""
select *, coalesce(sum(Revenue) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)), 0) as Total_Sum3
from df_sql
""")
这是 Apache Spark,我的错! (我在 Databricks 工作,我认为它是 MySQL 的幕后黑手)。是不是来不及改标题了?
@Barmar,您是对的,IFNULL()
不会将 NaN
视为 null
。感谢@user6910411,我从这里找到了修复方法:SO link。我不得不更改 numpy NaN 以引发空值。创建示例 df_pd 后的正确代码:
spark_df = spark.createDataFrame(df_pd)
from pyspark.sql.functions import isnan, col, when
#this converts all NaNs in numeric columns to null:
spark_df = spark_df.select([
when(~isnan(c), col(c)).alias(c) if t in ("double", "float") else c
for c, t in spark_df.dtypes])
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql order by Client,Month
""")
df_out.show()
然后给出所需的:
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| null| 201.0|
| B|201703| 203.0| 201.0|
| B|201704| 204.0| 404.0|
| B|201705| 205.0| 407.0|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
sqlContext 是解决这个问题的最佳方法吗?还是通过 pyspark.sql.window 实现相同的结果会更好/更优雅?
我正在尝试获取每个客户过去 3 个月行(不包括当前行)的收入总和。当前在 Databricks 中尝试的最小示例:
cols = ['Client','Month','Revenue']
df_pd = pd.DataFrame([['A',201701,100],
['A',201702,101],
['A',201703,102],
['A',201704,103],
['A',201705,104],
['B',201701,201],
['B',201702,np.nan],
['B',201703,203],
['B',201704,204],
['B',201705,205],
['B',201706,206],
['B',201707,207]
])
df_pd.columns = cols
spark_df = spark.createDataFrame(df_pd)
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql
""")
df_out.show()
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| NaN| 201.0|
| B|201703| 203.0| NaN|
| B|201704| 204.0| NaN|
| B|201705| 205.0| NaN|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
如您所见,如果第 3 个月 window 中的任何位置存在空值,则返回空值。我想将空值视为 0,因此进行了 ifnull 尝试,但这似乎不起作用。我也试过用 case 语句将 NULL 更改为 0,但没有成功。
只是 coalesce
外和:
df_out = sqlContext.sql("""
select *, coalesce(sum(Revenue) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)), 0) as Total_Sum3
from df_sql
""")
这是 Apache Spark,我的错! (我在 Databricks 工作,我认为它是 MySQL 的幕后黑手)。是不是来不及改标题了?
@Barmar,您是对的,IFNULL()
不会将 NaN
视为 null
。感谢@user6910411,我从这里找到了修复方法:SO link。我不得不更改 numpy NaN 以引发空值。创建示例 df_pd 后的正确代码:
spark_df = spark.createDataFrame(df_pd)
from pyspark.sql.functions import isnan, col, when
#this converts all NaNs in numeric columns to null:
spark_df = spark_df.select([
when(~isnan(c), col(c)).alias(c) if t in ("double", "float") else c
for c, t in spark_df.dtypes])
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql order by Client,Month
""")
df_out.show()
然后给出所需的:
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| null| 201.0|
| B|201703| 203.0| 201.0|
| B|201704| 204.0| 404.0|
| B|201705| 205.0| 407.0|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
sqlContext 是解决这个问题的最佳方法吗?还是通过 pyspark.sql.window 实现相同的结果会更好/更优雅?