我如何在 spark 2.01 中执行这种类型的笛卡尔积

Question

// 我正在使用 Spark 2.01 //

我的数据看起来像，

(K1,Array(V1,V2,V3.....V30))
(K2,Array(V1,V2,V3.....V30))
(K3,Array(V1,V2,V3.....V30))
...
(K3704, Array(V1,V2,V3.....V30))

我想为每个键的值创建一个笛卡尔产品列表值。

(K1, (V1,V2),(V1,V3),(V1,V4) ...
(K2, (V2,V3),(V2,V4),(V2,V5) ...
...
//PS. there are no duplicate elements like (V1,V2) == (V2,V1)

而且我认为会有 30 个！每个键的操作，如果能优化一下就更好了。

Answer 1

在Python中我们可以使用itertools里面的包combinations()函数mapValues():

from itertools import combinations
rdd.mapValues(lambda x: list(combinations(x, 2)))

在Scala中，我们可以类似的方式使用combinations()方法。但是因为它只摄取和输出对象类型 Seq，我们必须将更多方法链接在一起才能达到您期望的格式：

rdd.mapValues(_.toSeq.combinations(2).toArray.map{case Seq(x,y) => (x,y)})

How do I perform this type of cartesian product in spark 2.01