如何从每一行的列中提取特定元素？

Question

我在 Spark 2.2.0 和 Scala 2.11.8 中有以下 DataFrame。

+----------+-------------------------------+
|item      |        other_items            |
+----------+-------------------------------+
|  111     |[[444,1.0],[333,0.5],[666,0.4]]|
|  222     |[[444,1.0],[333,0.5]]          |
|  333     |[]                             |
|  444     |[[111,2.0],[555,0.5],[777,0.2]]|

我想获取以下DataFrame：

+----------+-------------+
|item      | other_items |
+----------+-------------+
|  111     | 444         |
|  222     | 444         |
|  444     | 111         |

所以，基本上，我需要从每一行的 other_items 中提取第一个 item。此外，我需要忽略 other_products.

中具有空数组 [] 的那些行

我该怎么做？

我试过这种方法，但它没有给我预期的结果。

result = df.withColumn("other_items",$"other_items"(0))

printScheme 给出以下输出：

 |-- item: string (nullable = true)
 |-- other_items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: double (nullable = true)

Answer 1

像这样：

val df = Seq(
  ("111", Seq(("111", 1.0), ("333", 0.5), ("666", 0.4))), ("333", Seq())
).toDF("item", "other_items")


df.select($"item", $"other_items"(0)("_1").alias("other_items"))
  .na.drop(Seq("other_items")).show

其中第一个apply($"other_items"(0))选择数组的第一个元素，第二个apply(_("_1"))selects_1 字段，并且 na.drop 删除由空数组引入的 nulls。

+----+-----------+
|item|other_items|
+----+-----------+
| 111|        111|
+----+-----------+

如何从每一行的列中提取特定元素？

How to extract particular element from the column for each row?

scala

apache-spark

spark-dataframe