Spark:以编程方式从数组值填充列
Spark: programmatically populate columns from array values
我有一列是标识符列表(在本例中为跑道)。它可以是一个数组或一个逗号分隔的列表,在这个例子中我将它转换为一个数组。我正在尝试找出 idiomatic/programmatic 方法来根据所述数组的内容更新一组列。
使用反模式的工作示例:
val data = Seq("08L,08R,09")
val df = data.toDF("runways")
.withColumn("runway_set", split('runways, ","))
.withColumn("runway_in_use_08L", when(array_contains('runway_set, "08L"), 1).otherwise(0))
.withColumn("runway_in_use_26R", when(array_contains('runway_set, "26R"), 1).otherwise(0))
.withColumn("runway_in_use_08R", when(array_contains('runway_set, "08R"), 1).otherwise(0))
.withColumn("runway_in_use_26L", when(array_contains('runway_set, "26L"), 1).otherwise(0))
.withColumn("runway_in_use_09", when(array_contains('runway_set, "09"), 1).otherwise(0))
.withColumn("runway_in_use_27", when(array_contains('runway_set, "27"), 1).otherwise(0))
.withColumn("runway_in_use_15L", when(array_contains('runway_set, "15L"), 1).otherwise(0))
.withColumn("runway_in_use_33R", when(array_contains('runway_set, "33R"), 1).otherwise(0))
.withColumn("runway_in_use_15R", when(array_contains('runway_set, "15R"), 1).otherwise(0))
.withColumn("runway_in_use_33L", when(array_contains('runway_set, "33L"), 1).otherwise(0))
这基本上会产生单热编码值,如下所示:
+----------+--------------+-----------------+-----------------+-----------------+-----------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+
| runways| runway_set|runway_in_use_08L|runway_in_use_26R|runway_in_use_08R|runway_in_use_26L|runway_in_use_09|runway_in_use_27|runway_in_use_15L|runway_in_use_33R|runway_in_use_15R|runway_in_use_33L|
+----------+--------------+-----------------+-----------------+-----------------+-----------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+
|08L,08R,09|[08L, 08R, 09]| 1| 0| 1| 0| 1| 0| 0| 0| 0| 0|
+----------+--------------+-----------------+-----------------+-----------------+-----------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+
感觉我应该能够采用所有标识符的静态序列并执行一些编程操作以在 loop/map/foreach 类型的表达式中完成上述所有操作,但我不确定如何表述它.
例如:
val all_runways = Seq("08L","26R","08R","26L","09","27","15L","33R","15R","33L")
// iterate through and update each column, e.g. 'runway_in_use_$i'
有什么指点吗?提前致谢。
fold 的典型用例。
val df = data.toDF("runways")
.withColumn("runway_set", split('runways, ","))
val df2 = all_runways.foldLeft(df) { (acc, x) =>
acc.withColumn(s"runway_in_use_$x", when(array_contains('runway_set, x), 1).otherwise(0))
}
我有一列是标识符列表(在本例中为跑道)。它可以是一个数组或一个逗号分隔的列表,在这个例子中我将它转换为一个数组。我正在尝试找出 idiomatic/programmatic 方法来根据所述数组的内容更新一组列。
使用反模式的工作示例:
val data = Seq("08L,08R,09")
val df = data.toDF("runways")
.withColumn("runway_set", split('runways, ","))
.withColumn("runway_in_use_08L", when(array_contains('runway_set, "08L"), 1).otherwise(0))
.withColumn("runway_in_use_26R", when(array_contains('runway_set, "26R"), 1).otherwise(0))
.withColumn("runway_in_use_08R", when(array_contains('runway_set, "08R"), 1).otherwise(0))
.withColumn("runway_in_use_26L", when(array_contains('runway_set, "26L"), 1).otherwise(0))
.withColumn("runway_in_use_09", when(array_contains('runway_set, "09"), 1).otherwise(0))
.withColumn("runway_in_use_27", when(array_contains('runway_set, "27"), 1).otherwise(0))
.withColumn("runway_in_use_15L", when(array_contains('runway_set, "15L"), 1).otherwise(0))
.withColumn("runway_in_use_33R", when(array_contains('runway_set, "33R"), 1).otherwise(0))
.withColumn("runway_in_use_15R", when(array_contains('runway_set, "15R"), 1).otherwise(0))
.withColumn("runway_in_use_33L", when(array_contains('runway_set, "33L"), 1).otherwise(0))
这基本上会产生单热编码值,如下所示:
+----------+--------------+-----------------+-----------------+-----------------+-----------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+
| runways| runway_set|runway_in_use_08L|runway_in_use_26R|runway_in_use_08R|runway_in_use_26L|runway_in_use_09|runway_in_use_27|runway_in_use_15L|runway_in_use_33R|runway_in_use_15R|runway_in_use_33L|
+----------+--------------+-----------------+-----------------+-----------------+-----------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+
|08L,08R,09|[08L, 08R, 09]| 1| 0| 1| 0| 1| 0| 0| 0| 0| 0|
+----------+--------------+-----------------+-----------------+-----------------+-----------------+----------------+----------------+-----------------+-----------------+-----------------+-----------------+
感觉我应该能够采用所有标识符的静态序列并执行一些编程操作以在 loop/map/foreach 类型的表达式中完成上述所有操作,但我不确定如何表述它.
例如:
val all_runways = Seq("08L","26R","08R","26L","09","27","15L","33R","15R","33L")
// iterate through and update each column, e.g. 'runway_in_use_$i'
有什么指点吗?提前致谢。
fold 的典型用例。
val df = data.toDF("runways")
.withColumn("runway_set", split('runways, ","))
val df2 = all_runways.foldLeft(df) { (acc, x) =>
acc.withColumn(s"runway_in_use_$x", when(array_contains('runway_set, x), 1).otherwise(0))
}