SQL

Question

我正在将 SQL 与 pyspark 和 hive 一起使用，我是新手。我有一个带有字符串类型列的配置单元 table，如下所示：

id | values
1  | '2;4;4'
2  |  '5;1'
3  |  '8;0;4'

我想创建一个查询来获取这个：

id | values | sum
1  | '2.2;4;4'  | 10.2
2  |  '5;1.2' |  6.2
3  |  '8;0;4' | 12

通过使用 split(values, ';') 我可以获得像 ['2.2','4','4'] 这样的数组，但我仍然需要将它们转换为十进制数并对它们求和。有什么不太复杂的方法吗？

非常感谢您！祝大家编码愉快:)

Answer 1

写一个 stored procedure 来完成工作：

CREATE FUNCTION SPLIT_AND_SUM ( s VARCHAR(1024) ) RETURNS INT
BEGIN
   ...
END

Answer 2

PySpark 解决方案

from pyspark.sql.functions import udf,col,split
from pyspark.sql.types import FloatType 
#UDF to sum the split values returning none when non numeric values exist in the string
#Change the implementation of the function as needed
def values_sum(split_list):
    total = 0
    for num in split_list:
        try:
            total += float(num)
        except ValueError:
            return None
    return total

values_summed = udf(values_sum,FloatType())
res = df.withColumn('summed',values_summed(split(col('values'),';')))
res.show()

如果已知数组值属于给定数据类型，则解决方案可能是单行的。但是，最好采用涵盖所有情况的更安全的实施方式。

蜂巢解决方案

使用 explode 与 split 和 group by 对值求和。

select id,sum(cast(split_value as float)) as summed
from tbl
lateral view explode(split(values,';')) t as split_value
group by id

Answer 3

来自Spark-2.4+

我们不必在数组上使用 explode，而是直接在数组上使用 higher order functions。

Example:

from pyspark.sql.functions import *

df=spark.createDataFrame([("1","2;4;4"),("2","5;1"),("3","8;0;4")],["id","values"])

#split and creating array<int> column
df1=df.withColumn("arr",split(col("values"),";").cast("array<int>"))

df1.createOrReplaceTempView("tmp")

spark.sql("select *,aggregate(arr,0,(x,y) -> x + y) as sum from tmp").drop("arr").show()
#+---+------+---+
#| id|values|sum|
#+---+------+---+
#|  1| 2;4;4| 10|
#|  2|   5;1|  6|
#|  3| 8;0;4| 12|
#+---+------+---+

#in dataframe API

df1.selectExpr("*","aggregate(arr,0,(x,y) -> x + y) as sum").drop("arr").show()
#+---+------+---+
#| id|values|sum|
#+---+------+---+
#|  1| 2;4;4| 10|
#|  2|   5;1|  6|
#|  3| 8;0;4| 12|
#+---+------+---+

SQL - 如何对数组的元素求和？

SQL - How can I sum elements of an array?

csv

hive

pyspark

pyspark-sql