如何计算具有相同符号的元素的数量？

Question

如何计算每行中重复的正元素或负元素的数量？

假设我有以下数据：

ski      2020    2021      2022     2023       2024      2025
book      1.2     5.6       8.4      -2         -5         6
jar       4.2      -5        -8      2          4           6
kook       -4      -5.2      -2.3    -5.6        -7        8

输出是每行的列表，计算相似符号的数量。例如，在第一行中，我们有 3 个正元素，然后是 2 个负元素和一个正元素。所以输出是[3,-2,1]。对于其他 2 行，输出如下：

 jar   [1,-2,3]
 kook   [-5,1]

Answer 1

你可以用 user-defined function 使用 Python 的 itertools.groupby（lambda x: (1, -1)[x<0] 是符号函数）

df.show()
# +----+------+------+------+------+----+----+                                    
# |   0|     1|     2|     3|     4|   5|   6|
# +----+------+------+------+------+----+----+
# | ski|2020.0|2021.0|2022.0|2023.0|2024|2025|
# |book|   1.2|   5.6|   8.4|  -2.0|  -5|   6|
# | jar|   4.2|  -5.0|  -8.0|   2.0|   4|   6|
# |kook|  -4.0|  -5.2|  -2.3|  -5.6|  -7|   8|
# +----+------+------+------+------+----+----+

from pyspark.sql.functions import udf, array
from itertools import groupby
from pyspark.sql.types import IntegerType, ArrayType
 
def count_signs(l):
     return [(s*len(list(g))) for s, g in groupby(map(lambda x: (1, -1)[x<0], l))]

count_signs_udf = udf(count_signs, ArrayType(IntegerType()))

df.withColumn('signs', count_signs_udf(array(df.columns[1:]))).show()
# +----+------+------+------+------+----+----+----------+
# |   0|     1|     2|     3|     4|   5|   6|     signs|
# +----+------+------+------+------+----+----+----------+
# | ski|2020.0|2021.0|2022.0|2023.0|2024|2025|       [6]|
# |book|   1.2|   5.6|   8.4|  -2.0|  -5|   6|[3, -2, 1]|
# | jar|   4.2|  -5.0|  -8.0|   2.0|   4|   6|[1, -2, 3]|
# |kook|  -4.0|  -5.2|  -2.3|  -5.6|  -7|   8|   [-5, 1]|
# +----+------+------+------+------+----+----+----------+

如何计算具有相同符号的元素的数量？

how count the number of elements with the same sign?

python

dataframe

pyspark