如何统计一列数组中的元素?
How to count the elements in a column of arrays?
我正在尝试计算以下 DataFrame 中 FavouriteCities
列中的元素数。
+-----------------+
| FavouriteCities |
+-----------------+
| [NY, Canada] |
+-----------------+
架构如下:
scala> data.printSchema
root
|-- FavouriteCities: array (nullable = true)
| |-- element: string (containsNull = true)
预期输出应该类似于,
+------------+-------------+
| City | Count |
+------------+-------------+
| NY | 1 |
| Canada | 1 |
+------------+-------------+
我曾尝试使用 agg()
和 count()
,但如下所示,但它无法从数组中提取单个元素并尝试在列中找到最常见的元素集。
data.agg(count("FavouriteCities").alias("count"))
有人可以指导我吗?
要匹配您显示的架构:
scala> val data = Seq(Tuple1(Array("NY", "Canada"))).toDF("FavouriteCities")
data: org.apache.spark.sql.DataFrame = [FavouriteCities: array<string>]
scala> data.printSchema
root
|-- FavouriteCities: array (nullable = true)
| |-- element: string (containsNull = true)
爆炸:
val counts = data
.select(explode($"FavouriteCities" as "City"))
.groupBy("City")
.count
合计:
import spark.implicits._
scala> counts.as[(String, Long)].reduce((a, b) => if (a._2 > b._2) a else b)
res3: (String, Long) = (Canada,1)
我正在尝试计算以下 DataFrame 中 FavouriteCities
列中的元素数。
+-----------------+
| FavouriteCities |
+-----------------+
| [NY, Canada] |
+-----------------+
架构如下:
scala> data.printSchema
root
|-- FavouriteCities: array (nullable = true)
| |-- element: string (containsNull = true)
预期输出应该类似于,
+------------+-------------+
| City | Count |
+------------+-------------+
| NY | 1 |
| Canada | 1 |
+------------+-------------+
我曾尝试使用 agg()
和 count()
,但如下所示,但它无法从数组中提取单个元素并尝试在列中找到最常见的元素集。
data.agg(count("FavouriteCities").alias("count"))
有人可以指导我吗?
要匹配您显示的架构:
scala> val data = Seq(Tuple1(Array("NY", "Canada"))).toDF("FavouriteCities")
data: org.apache.spark.sql.DataFrame = [FavouriteCities: array<string>]
scala> data.printSchema
root
|-- FavouriteCities: array (nullable = true)
| |-- element: string (containsNull = true)
爆炸:
val counts = data
.select(explode($"FavouriteCities" as "City"))
.groupBy("City")
.count
合计:
import spark.implicits._
scala> counts.as[(String, Long)].reduce((a, b) => if (a._2 > b._2) a else b)
res3: (String, Long) = (Canada,1)