使用现有键创建值列表的组合和总和 - Pyspark

Question

我的问题与给出的问题类似，但是，我有一个额外的字段，我想从中获取总和，也就是说，我的 RDD 如下（我将其显示为数据框）

+----------+----------------+----------------+
|    c1    |        c2      |      val       |
+----------+----------------+----------------+
|        t1|         [a, b] |        [11, 12]|
|        t2|     [a, b, c ] |    [13, 14, 15]|
|        t3|   [a, b, c, d] |[16, 17, 18, 19]|
+----------+----------------+----------------+

我想得到这样的东西：

        +----------+----------------+----------------+
        |    c1    |        c2      |     sum(val)   |
        +----------+----------------+----------------+
        |        t1|         [a, b] |        23      |
        |        t2|         [a, b] |        27      |
        |        t2|         [a, c] |        28      |
        |        t2|         [b, d] |        29      |
        |        t3|         [a, b] |        33      |
        |        t3|         [a, c] |        34      |
        |        t3|         [a, d] |        35      |
        |        t3|         [b, c] |        35      |
        |        t3|         [b, d] |        36      |
        |        t3|         [c, d] |        37      |
        +----------+----------------+----------------+

使用以下代码我得到前两列

def combinations(row):
    l = row[1]
    k = row[0]
    m = row[2]
return [(k, v) for v in itertools.combinations(l, 2)]

a.map(combinations).flatMap(lambda x: x).take(5)

使用此代码，我尝试获取第三列，但我得到了更多行

    def combinations(row):
            l = row[1]
            k = row[0]
            m = row[2]
    return [(k, v, x) for v in itertools.combinations(l, 2) for x in map(sum, itertools.combinations(m, 2)) ]
        
a.map(combinations).flatMap(lambda x: x).take(5)

如有任何帮助，我将不胜感激。

Answer 1

尝试以下：

a = sc.parallelize([
    (1, [1,2,3,4], [11,12,13,14]),
    (2, [3,4,5,6], [15,16,17,18]),
    (3, [-1,2,3,4], [19,20,21,22])
  ])

def combinations(row):
    l = row[1]
    k = row[0]
    m = row[2]
    return [(k, v, x) for v in itertools.combinations(l, 2) for x in map(sum, itertools.combinations(m, 2))]

a.map(combinations).flatMap(lambda x: x).take(5)

Answer 2

如下解决

    def combinations(row):
    l = row[1]
    k = row[0]
    m = row[2]
    return [(k, v,m[l.index(v[0])]+m[l.index(v[1])]) for v in itertools.combinations(l, 2)]

a.map(combinations).flatMap(lambda x: x).take(5)

由于第二列和第三列的元素个数相同，所以我把元素抽出来相加。感谢 Lavesh 的回答，他帮我找到了解决方案。

使用现有键创建值列表的组合和总和 - Pyspark

Creating combination and sum of value lists with existing key - Pyspark

python

apache-spark

rdd

pyspark