在 pyspark 中合并两个 RDD

Question

假设我有以下 RDD：

a = sc.parallelize([1, 2, 5, 3])
b = sc.parallelize(['a','c','d','e'])

如何将这 2 个 RDD 合并为一个 RDD，如下所示：

[('a', 1), ('c', 2), ('d', 5), ('e', 3)]

使用a.union(b)只是将它们组合在一个列表中。有什么想法吗？

Answer 1

您可能只想 b.zip(a) 两个 RDD（请注意相反的顺序，因为您希望按 b 的值进行键控）。

请仔细阅读python docs：

zip(other)

Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).

x = sc.parallelize(range(0,5))
y = sc.parallelize(range(1000, 1005))
x.zip(y).collect()
[(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)]

在 pyspark 中合并两个 RDD

Combine two RDDs in pyspark

apache-spark

rdd

pyspark