Spark:将数据框中的列与它们之间的字符组合
Spark: Combine columns in dataframe with a character between them
我想将 spark 数据框中的许多列合并为一列,并在每列之间添加一个分隔符。我不想将所有列与分隔它们的字符组合在一起,只是其中的一些。在此示例中,我想在除前两列之外的所有值之间添加一个竖线。
这是一个示例输入:
+---+--------+----------+----------+---------+
|id | detail | context | col3 | col4|
+---+--------+----------+----------+---------+
| 1 | {blah} | service | null | null |
| 2 | { blah | """ blah | """blah} | service |
| 3 | { blah | """blah} | service | null |
+---+--------+----------+----------+---------+
预期的输出是这样的:
+---+--------+----------+----------+---------+--------------------------------+
|id | detail | context | col3 | col4| data
+---+--------+----------+----------+---------+--------------------------------+
| 1 | {blah} | service | null | null | service||
| 2 | { blah | """ blah | """blah} | service | """blah|"""blah}|service
| 3 | { blah | """blah} | service | null | """blah}|service|
+---+--------+----------+----------+---------+--------------------------------+
目前,我有如下内容:
val columns = df.columns.filterNot(_ == "id").filterNot(_ =="detail")
val nonulls = df.na.fill("")
val combined = nonulls.select($"id", concat(columns.map(col):_*) as "data")
以上将列组合在一起,但没有添加附加字符。如果我尝试了这些可能性,但我显然没有做对:
scala> val combined = nonulls.select($"id", concat(columns.map(col):_|*) as "data")
scala> val combined = nonulls.select($"id", concat(columns.map(col):_*, lit('|')) as "data")
scala> val combined = nonulls.select($"id", concat(columns.map(col):_*|) as "data")
如有任何建议,我们将不胜感激! :) 谢谢!
这应该可以解决问题:
val columns = df.columns.filterNot(_ == "id").filterNot(_ =="detail")
val columnsWithPipe = columns.flatMap(colname => Seq(col(colname),lit("|"))).dropRight(1)
val combined = nonulls.select($"id",concat(columnsWithPipe:_*) as "data")
只需使用 concat_ws 函数...它会将列与您选择的分隔符连接起来。
导入为
导入 org.apache.spark.sql.functions.concat_ws
我想将 spark 数据框中的许多列合并为一列,并在每列之间添加一个分隔符。我不想将所有列与分隔它们的字符组合在一起,只是其中的一些。在此示例中,我想在除前两列之外的所有值之间添加一个竖线。
这是一个示例输入:
+---+--------+----------+----------+---------+
|id | detail | context | col3 | col4|
+---+--------+----------+----------+---------+
| 1 | {blah} | service | null | null |
| 2 | { blah | """ blah | """blah} | service |
| 3 | { blah | """blah} | service | null |
+---+--------+----------+----------+---------+
预期的输出是这样的:
+---+--------+----------+----------+---------+--------------------------------+
|id | detail | context | col3 | col4| data
+---+--------+----------+----------+---------+--------------------------------+
| 1 | {blah} | service | null | null | service||
| 2 | { blah | """ blah | """blah} | service | """blah|"""blah}|service
| 3 | { blah | """blah} | service | null | """blah}|service|
+---+--------+----------+----------+---------+--------------------------------+
目前,我有如下内容:
val columns = df.columns.filterNot(_ == "id").filterNot(_ =="detail")
val nonulls = df.na.fill("")
val combined = nonulls.select($"id", concat(columns.map(col):_*) as "data")
以上将列组合在一起,但没有添加附加字符。如果我尝试了这些可能性,但我显然没有做对:
scala> val combined = nonulls.select($"id", concat(columns.map(col):_|*) as "data")
scala> val combined = nonulls.select($"id", concat(columns.map(col):_*, lit('|')) as "data")
scala> val combined = nonulls.select($"id", concat(columns.map(col):_*|) as "data")
如有任何建议,我们将不胜感激! :) 谢谢!
这应该可以解决问题:
val columns = df.columns.filterNot(_ == "id").filterNot(_ =="detail")
val columnsWithPipe = columns.flatMap(colname => Seq(col(colname),lit("|"))).dropRight(1)
val combined = nonulls.select($"id",concat(columnsWithPipe:_*) as "data")
只需使用 concat_ws 函数...它会将列与您选择的分隔符连接起来。
导入为 导入 org.apache.spark.sql.functions.concat_ws