使用现有键创建值列表的组合和总和 - Pyspark
Creating combination and sum of value lists with existing key - Pyspark
我的问题与给出的问题类似 ,但是,我有一个额外的字段,我想从中获取总和,也就是说,我的 RDD 如下(我将其显示为数据框)
+----------+----------------+----------------+
| c1 | c2 | val |
+----------+----------------+----------------+
| t1| [a, b] | [11, 12]|
| t2| [a, b, c ] | [13, 14, 15]|
| t3| [a, b, c, d] |[16, 17, 18, 19]|
+----------+----------------+----------------+
我想得到这样的东西:
+----------+----------------+----------------+
| c1 | c2 | sum(val) |
+----------+----------------+----------------+
| t1| [a, b] | 23 |
| t2| [a, b] | 27 |
| t2| [a, c] | 28 |
| t2| [b, d] | 29 |
| t3| [a, b] | 33 |
| t3| [a, c] | 34 |
| t3| [a, d] | 35 |
| t3| [b, c] | 35 |
| t3| [b, d] | 36 |
| t3| [c, d] | 37 |
+----------+----------------+----------------+
使用以下代码我得到前两列
def combinations(row):
l = row[1]
k = row[0]
m = row[2]
return [(k, v) for v in itertools.combinations(l, 2)]
a.map(combinations).flatMap(lambda x: x).take(5)
使用此代码,我尝试获取第三列,但我得到了更多行
def combinations(row):
l = row[1]
k = row[0]
m = row[2]
return [(k, v, x) for v in itertools.combinations(l, 2) for x in map(sum, itertools.combinations(m, 2)) ]
a.map(combinations).flatMap(lambda x: x).take(5)
如有任何帮助,我将不胜感激。
尝试以下:
a = sc.parallelize([
(1, [1,2,3,4], [11,12,13,14]),
(2, [3,4,5,6], [15,16,17,18]),
(3, [-1,2,3,4], [19,20,21,22])
])
def combinations(row):
l = row[1]
k = row[0]
m = row[2]
return [(k, v, x) for v in itertools.combinations(l, 2) for x in map(sum, itertools.combinations(m, 2))]
a.map(combinations).flatMap(lambda x: x).take(5)
如下解决
def combinations(row):
l = row[1]
k = row[0]
m = row[2]
return [(k, v,m[l.index(v[0])]+m[l.index(v[1])]) for v in itertools.combinations(l, 2)]
a.map(combinations).flatMap(lambda x: x).take(5)
由于第二列和第三列的元素个数相同,所以我把元素抽出来相加。感谢 Lavesh 的回答,他帮我找到了解决方案。
我的问题与给出的问题类似
+----------+----------------+----------------+
| c1 | c2 | val |
+----------+----------------+----------------+
| t1| [a, b] | [11, 12]|
| t2| [a, b, c ] | [13, 14, 15]|
| t3| [a, b, c, d] |[16, 17, 18, 19]|
+----------+----------------+----------------+
我想得到这样的东西:
+----------+----------------+----------------+
| c1 | c2 | sum(val) |
+----------+----------------+----------------+
| t1| [a, b] | 23 |
| t2| [a, b] | 27 |
| t2| [a, c] | 28 |
| t2| [b, d] | 29 |
| t3| [a, b] | 33 |
| t3| [a, c] | 34 |
| t3| [a, d] | 35 |
| t3| [b, c] | 35 |
| t3| [b, d] | 36 |
| t3| [c, d] | 37 |
+----------+----------------+----------------+
使用以下代码我得到前两列
def combinations(row):
l = row[1]
k = row[0]
m = row[2]
return [(k, v) for v in itertools.combinations(l, 2)]
a.map(combinations).flatMap(lambda x: x).take(5)
使用此代码,我尝试获取第三列,但我得到了更多行
def combinations(row):
l = row[1]
k = row[0]
m = row[2]
return [(k, v, x) for v in itertools.combinations(l, 2) for x in map(sum, itertools.combinations(m, 2)) ]
a.map(combinations).flatMap(lambda x: x).take(5)
如有任何帮助,我将不胜感激。
尝试以下:
a = sc.parallelize([
(1, [1,2,3,4], [11,12,13,14]),
(2, [3,4,5,6], [15,16,17,18]),
(3, [-1,2,3,4], [19,20,21,22])
])
def combinations(row):
l = row[1]
k = row[0]
m = row[2]
return [(k, v, x) for v in itertools.combinations(l, 2) for x in map(sum, itertools.combinations(m, 2))]
a.map(combinations).flatMap(lambda x: x).take(5)
如下解决
def combinations(row):
l = row[1]
k = row[0]
m = row[2]
return [(k, v,m[l.index(v[0])]+m[l.index(v[1])]) for v in itertools.combinations(l, 2)]
a.map(combinations).flatMap(lambda x: x).take(5)
由于第二列和第三列的元素个数相同,所以我把元素抽出来相加。感谢 Lavesh 的回答,他帮我找到了解决方案。