如何使用scala在df中获取整行的大小

How to get whole row's size in df using scala

DataFrame 有多个列。我需要为整个行大小添加一个新列,这意味着我需要将所有列大小加在一起。有没有一种简单的方法可以有效地做到这一点?谢谢

示例如下:

val DataFrame = Seq(("Alice", "He is girl"), ("Bob", "She is girl"), ("Ben", null)).toDF("name","string") 
display(DataFrame) 

我想在 df 中添加一列,它可以对每列的长度求和。在这个示例中只有两列,但实际上我在 df 中有一百列。

val df = Seq(("Alice", "He is girl"), 
   ("Bob", "She is girl"), ("Ben", null)).toDF("name","string")

scala> df.show
+-----+-----------+
| name|     string|
+-----+-----------+
|Alice| He is girl|
|  Bob|She is girl|
|  Ben|       null|
+-----+-----------+

删除空值:

val dfNoNull = df.na.fill("")

scala> dfNoNull.show
+-----+-----------+
| name|     string|
+-----+-----------+
|Alice| He is girl|
|  Bob|She is girl|
|  Ben|           |
+-----+-----------+

创建列列表,并对每个列应用 length 函数:

val cols = dfNoNull.columns.map(x => length(col(x)))

Select 数据基于这些 columns/expressions:

val dfColCounts = dfNoNull.select(cols:_*)

scala> dfColCounts.show
+------------+--------------+
|length(name)|length(string)|
+------------+--------------+
|           5|            10|
|           3|            11|
|           3|             0|
+------------+--------------+

获取这些新的列名称:

val countCols = dfColCounts.columns.map(x => col(x))

应用 reduce 对所有现在为整数的列值求和:

val dfPerRowCounts = dfColCounts
   .withColumn("countPerRow", countCols.reduce(_ + _))
   .select("countPerRow")

结果:

dfPerRowCounts.show

scala> dfPerRowCounts.show
+-----------+
|countPerRow|
+-----------+
|         15|
|         14|
|          3|
+-----------+