火花高阶函数转换输出结构
spark higher order function transform output struct
如何使用 spark 高阶函数 transform
将结构数组再次转换为结构?
数据集:
case class Foo(thing1:String, thing2:String, thing3:String)
case class Baz(foo:Foo, other:String)
case class Bar(id:Int, bazes:Seq[Baz])
import spark.implicits._
val df = Seq(Bar(1, Seq(Baz(Foo("first", "second", "third"), "other"), Baz(Foo("1", "2", "3"), "else")))).toDF
df.printSchema
df.show(false)
我想连接所有 thing1, thign2, thing3
但为每个 bar
.
保留 other
属性
一个简单的:
scala> df.withColumn("cleaned", expr("transform(bazes, x -> x)")).printSchema
root
|-- id: integer (nullable = false)
|-- bazes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- foo: struct (nullable = true)
| | | |-- thing1: string (nullable = true)
| | | |-- thing2: string (nullable = true)
| | | |-- thing3: string (nullable = true)
| | |-- other: string (nullable = true)
|-- cleaned: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- foo: struct (nullable = true)
| | | |-- thing1: string (nullable = true)
| | | |-- thing2: string (nullable = true)
| | | |-- thing3: string (nullable = true)
| | |-- other: string (nullable = true)
只会把东西抄过来
所需的连接操作:
df.withColumn("cleaned", expr("transform(bazes, x -> concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3))")).printSchema
不幸的是,将删除 other
列中的所有值:
+---+----------------------------------------------------+-------------------------------+
|id |bazes |cleaned |
+---+----------------------------------------------------+-------------------------------+
|1 |[[[first, second, third], other], [[1, 2, 3], else]]|[first::second::third, 1::2::3]|
+---+----------------------------------------------------+-------------------------------+
如何保留这些?
尝试保留元组:
df.withColumn("cleaned", expr("transform(bazes, x -> (concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3), x.other))")).printSchema
失败:
.AnalysisException: cannot resolve 'named_struct('col1', concat(namedlambdavariable().`foo`.`thing1`, '::', namedlambdavariable().`foo`.`thing2`, '::', namedlambdavariable().`foo`.`thing3`), NamePlaceholder(), namedlambdavariable().`other`)' due to data type mismatch: Only foldable string expressions are allowed to appear at odd position, got: NamePlaceholder; line 1 pos 22;
编辑
期望的输出:
一个包含以下内容的新专栏:
[[first::second::third, other], [1::2::3,else]
保留列other
In this way, you can achieve your desired output. you cannot directly access other value bcoz foo and other are sharing the same hierarchy. so you need to access other separately.
scala> df.withColumn("cleaned", expr("transform(bazes, x -> struct(concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3),cast(x.other as string)))")).show(false)
+---+----------------------------------------------------+------------------------------------------------+
|id |bazes |cleaned |
+---+----------------------------------------------------+------------------------------------------------+
printSchema
scala> df.withColumn("cleaned", expr("transform(bazes, x -> struct(concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3),cast(x.other as string)))")).printSchema
root
|-- id: integer (nullable = false)
|-- bazes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- foo: struct (nullable = true)
| | | |-- thing1: string (nullable = true)
| | | |-- thing2: string (nullable = true)
| | | |-- thing3: string (nullable = true)
| | |-- other: string (nullable = true)
|-- cleaned: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- col1: string (nullable = true)
| | |-- col2: string (nullable = true)
如果您还有任何与此相关的问题,请告诉我。
如何使用 spark 高阶函数 transform
将结构数组再次转换为结构?
数据集:
case class Foo(thing1:String, thing2:String, thing3:String)
case class Baz(foo:Foo, other:String)
case class Bar(id:Int, bazes:Seq[Baz])
import spark.implicits._
val df = Seq(Bar(1, Seq(Baz(Foo("first", "second", "third"), "other"), Baz(Foo("1", "2", "3"), "else")))).toDF
df.printSchema
df.show(false)
我想连接所有 thing1, thign2, thing3
但为每个 bar
.
other
属性
一个简单的:
scala> df.withColumn("cleaned", expr("transform(bazes, x -> x)")).printSchema
root
|-- id: integer (nullable = false)
|-- bazes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- foo: struct (nullable = true)
| | | |-- thing1: string (nullable = true)
| | | |-- thing2: string (nullable = true)
| | | |-- thing3: string (nullable = true)
| | |-- other: string (nullable = true)
|-- cleaned: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- foo: struct (nullable = true)
| | | |-- thing1: string (nullable = true)
| | | |-- thing2: string (nullable = true)
| | | |-- thing3: string (nullable = true)
| | |-- other: string (nullable = true)
只会把东西抄过来
所需的连接操作:
df.withColumn("cleaned", expr("transform(bazes, x -> concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3))")).printSchema
不幸的是,将删除 other
列中的所有值:
+---+----------------------------------------------------+-------------------------------+
|id |bazes |cleaned |
+---+----------------------------------------------------+-------------------------------+
|1 |[[[first, second, third], other], [[1, 2, 3], else]]|[first::second::third, 1::2::3]|
+---+----------------------------------------------------+-------------------------------+
如何保留这些? 尝试保留元组:
df.withColumn("cleaned", expr("transform(bazes, x -> (concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3), x.other))")).printSchema
失败:
.AnalysisException: cannot resolve 'named_struct('col1', concat(namedlambdavariable().`foo`.`thing1`, '::', namedlambdavariable().`foo`.`thing2`, '::', namedlambdavariable().`foo`.`thing3`), NamePlaceholder(), namedlambdavariable().`other`)' due to data type mismatch: Only foldable string expressions are allowed to appear at odd position, got: NamePlaceholder; line 1 pos 22;
编辑
期望的输出:
一个包含以下内容的新专栏:
[[first::second::third, other], [1::2::3,else]
保留列other
In this way, you can achieve your desired output. you cannot directly access other value bcoz foo and other are sharing the same hierarchy. so you need to access other separately.
scala> df.withColumn("cleaned", expr("transform(bazes, x -> struct(concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3),cast(x.other as string)))")).show(false)
+---+----------------------------------------------------+------------------------------------------------+
|id |bazes |cleaned |
+---+----------------------------------------------------+------------------------------------------------+
printSchema
scala> df.withColumn("cleaned", expr("transform(bazes, x -> struct(concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3),cast(x.other as string)))")).printSchema
root
|-- id: integer (nullable = false)
|-- bazes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- foo: struct (nullable = true)
| | | |-- thing1: string (nullable = true)
| | | |-- thing2: string (nullable = true)
| | | |-- thing3: string (nullable = true)
| | |-- other: string (nullable = true)
|-- cleaned: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- col1: string (nullable = true)
| | |-- col2: string (nullable = true)
如果您还有任何与此相关的问题,请告诉我。