Spark SQL:Select 对列值和类型转换进行算术运算?
Spark SQL: Select with arithmetic on column values and type casting?
我将 Spark SQL 与数据帧一起使用。有没有办法用一些算术来做一个 select 语句,just as you can in SQL?
例如,我有以下table:
var data = Array((1, "foo", 30, 5), (2, "bar", 35, 3), (3, "foo", 25, 4))
var dataDf = sc.parallelize(data).toDF("id", "name", "value", "years")
dataDf.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = true)
// |-- value: integer (nullable = false)
// |-- years: integer (nullable = false)
dataDf.show()
// +---+----+-----+-----+
// | id|name|value|years|
// +---+----+-----+-----+
// | 1| foo| 30| 5|
// | 2| bar| 35| 3|
// | 3| foo| 25| 4|
//+---+----+-----+-----+
现在,我想执行一个 SELECT 语句来创建一个新列,并对现有列执行一些算术运算。例如,我想计算比率 value/years
。我需要先将值(或年)转换为双精度值。我试过这个语句,但它不会解析:
dataDf.
select(dataDf("name"), (dataDf("value").toDouble/dataDf("years")).as("ratio")).
show()
<console>:35: error: value toDouble is not a member of org.apache.spark.sql.Column
select(dataDf("name"), (dataDf("value").toDouble/dataDf("years")).as("ratio")).
我在“How to change column types in Spark SQL's DataFrame?”中看到了类似的问题,但这不是我想要的。
更改 Column
类型的正确方法是使用 cast
方法。它可以采用描述字符串:
dataDf("value").cast("double") / dataDf("years")
或 DataType
:
import org.apache.spark.sql.types.DoubleType
dataDf("value").cast(DoubleType) / dataDf("years")
好吧,如果不需要使用 select
方法,您可以只使用 withColumn
.
val resDF = dataDf.withColumn("result", col("value").cast("double") / col("years"))
resDF.show
//+---+----+-----+-----+------------------+
//| id|name|value|years| result|
//+---+----+-----+-----+------------------+
//| 1| foo| 30| 5| 6.0|
//| 2| bar| 35| 3|11.666666666666666|
//| 3| foo| 25| 4| 6.25|
//+---+----+-----+-----+------------------+
如果需要使用 select
,一个选项可能是:
val exprs = dataDf.columns.map(col(_)) ++ List((col("value").cast("double") / col("years")).as("result"))
dataDf.select(exprs: _*).show
//+---+----+-----+-----+------------------+
//| id|name|value|years| result|
//+---+----+-----+-----+------------------+
//| 1| foo| 30| 5| 6.0|
//| 2| bar| 35| 3|11.666666666666666|
//| 3| foo| 25| 4| 6.25|
//+---+----+-----+-----+------------------+
我将 Spark SQL 与数据帧一起使用。有没有办法用一些算术来做一个 select 语句,just as you can in SQL?
例如,我有以下table:
var data = Array((1, "foo", 30, 5), (2, "bar", 35, 3), (3, "foo", 25, 4))
var dataDf = sc.parallelize(data).toDF("id", "name", "value", "years")
dataDf.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = true)
// |-- value: integer (nullable = false)
// |-- years: integer (nullable = false)
dataDf.show()
// +---+----+-----+-----+
// | id|name|value|years|
// +---+----+-----+-----+
// | 1| foo| 30| 5|
// | 2| bar| 35| 3|
// | 3| foo| 25| 4|
//+---+----+-----+-----+
现在,我想执行一个 SELECT 语句来创建一个新列,并对现有列执行一些算术运算。例如,我想计算比率 value/years
。我需要先将值(或年)转换为双精度值。我试过这个语句,但它不会解析:
dataDf.
select(dataDf("name"), (dataDf("value").toDouble/dataDf("years")).as("ratio")).
show()
<console>:35: error: value toDouble is not a member of org.apache.spark.sql.Column
select(dataDf("name"), (dataDf("value").toDouble/dataDf("years")).as("ratio")).
我在“How to change column types in Spark SQL's DataFrame?”中看到了类似的问题,但这不是我想要的。
更改 Column
类型的正确方法是使用 cast
方法。它可以采用描述字符串:
dataDf("value").cast("double") / dataDf("years")
或 DataType
:
import org.apache.spark.sql.types.DoubleType
dataDf("value").cast(DoubleType) / dataDf("years")
好吧,如果不需要使用 select
方法,您可以只使用 withColumn
.
val resDF = dataDf.withColumn("result", col("value").cast("double") / col("years"))
resDF.show
//+---+----+-----+-----+------------------+
//| id|name|value|years| result|
//+---+----+-----+-----+------------------+
//| 1| foo| 30| 5| 6.0|
//| 2| bar| 35| 3|11.666666666666666|
//| 3| foo| 25| 4| 6.25|
//+---+----+-----+-----+------------------+
如果需要使用 select
,一个选项可能是:
val exprs = dataDf.columns.map(col(_)) ++ List((col("value").cast("double") / col("years")).as("result"))
dataDf.select(exprs: _*).show
//+---+----+-----+-----+------------------+
//| id|name|value|years| result|
//+---+----+-----+-----+------------------+
//| 1| foo| 30| 5| 6.0|
//| 2| bar| 35| 3|11.666666666666666|
//| 3| foo| 25| 4| 6.25|
//+---+----+-----+-----+------------------+