Select keys and values only if there are more than 5 values that are more than 0

Question

以下是在scala中使用spark dataframe进行分组过滤后的数据：

+---------------+------+--------+-------+------+------+------+------+------+--------+
|         keys  |num_1 |num_2   |num_3  |num_4 |num_5 |num_6 |num_7 |num_8 |num_9   |
+---------------+------+--------+-------+------+------+------+------+------+--------+
|              1|     0|       0|      0|     0|     0|     0|     0|     0|       0|
|              2|     0|       0|      0|     0|     0|     0|     0|     0|       0|
|              3|     0|     134|      0|     0|    44|   332|     0|   423|     111|
|              4|     0|     338|      0|     0|     0|     0|     0|     0|       0|
|              5|     0|       0|      0|     0|     0|     0|     0|     0|       0|
|              6|     0|       0|      0|     0|     0|     0|     0|     0|       0|
|              7|     0|     130|      4|    11|     0|     5|  1222|     0|       0|
|              8|     0|       1|      0|     0|     0|     0|     0|     0|       2|

从过滤的数据中，是否有一种简单的方法可以 select 仅具有超过 5 个值且大于 0 的键？

（例如，只有键 3 和键 7 将 select 与它们在八个键中的值一起编辑）

我想到的唯一方法是分别检查每个值 (num_1、num_2、...、num_9)，如果它们大于 0，对变量进行增量（例如变量 'i'）。如果检查结束时变量大于 5，则 select 具有值的键。不过这种方式显得啰嗦。

Answer 1

使用以下方法创建过滤条件：

df.columns.tail.map(x => when(col(x) > 0, 1).otherwise(0)).reduce(_ + _) >= 5

将大于 0 的值转换为 1，否则为 0。然后在所有列中使用 reduce 来计算每行 1s。

df.filter(df.columns.tail.map(x => when(col(x) > 0, 1).otherwise(0)).reduce(_ + _) >= 5).show

+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|keys|num_1|num_2|num_3|num_4|num_5|num_6|num_7|num_8|num_9|
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|   3|    0|  134|    0|    0|   44|  332|    0|  423|  111|
|   7|    0|  130|    4|   11|    0|    5| 1222|    0|    0|
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

Select keys and values only if there are more than 5 values that are more than 0

Select keys and values only if there are more than 5 values that are more than 0

scala

apache-spark

spark-dataframe