PySpark:使用 2 个 RDD,逐元素比较
PySpark: Working with 2 RDDs, element-wise comparison
假设我有两个 RDDS,我想按元素进行比较:
data1 = [1,2,3]
rdd1 = spark.sparkContext.parallelize(data1)
data2 = [7,8,9]
rdd2 = spark.sparkContext.parallelize(data2)
将它们按元素相乘的最佳方法是什么,以便我最终得到以下数组?
rdd3 = [[7,8,9], [14,16,18], [21,24,27]]
我感觉这是一个连接操作,但我不确定如何设置键值对。
试试笛卡尔,像这样:
data1 = [1,2,3]
rdd1 = spark.sparkContext.parallelize(data1)
data2 = [[7,8,9]]
rdd2 = spark.sparkContext.parallelize(data2)
rdd1.cartesian(rdd2).map(lambda x: [x[0]*i for i in x[1]]).collect()
您可以找到 rdd's
的 cartesian
连接,然后减少它们以获得列表。
Note: Spark is a distributed processing engine and the reduceByKey
can return the final list in any order. If you want strong ordering guarantees, enrich your RDDs to include a index element.
data1 = [1,2,3]
rdd1 = spark.sparkContext.parallelize(data1)
data2 = [7,8,9]
rdd2 = spark.sparkContext.parallelize(data2)
rdd1.cartesian(rdd2)\
.map(lambda x: (x[0], [x[0] * x[1]]))\
.reduceByKey(lambda x, y: x + y)\
.map(lambda x: x[1]).collect()
输出
[[7, 8, 9], [14, 16, 18], [21, 24, 27]]
假设我有两个 RDDS,我想按元素进行比较:
data1 = [1,2,3]
rdd1 = spark.sparkContext.parallelize(data1)
data2 = [7,8,9]
rdd2 = spark.sparkContext.parallelize(data2)
将它们按元素相乘的最佳方法是什么,以便我最终得到以下数组?
rdd3 = [[7,8,9], [14,16,18], [21,24,27]]
我感觉这是一个连接操作,但我不确定如何设置键值对。
试试笛卡尔,像这样:
data1 = [1,2,3]
rdd1 = spark.sparkContext.parallelize(data1)
data2 = [[7,8,9]]
rdd2 = spark.sparkContext.parallelize(data2)
rdd1.cartesian(rdd2).map(lambda x: [x[0]*i for i in x[1]]).collect()
您可以找到 rdd's
的 cartesian
连接,然后减少它们以获得列表。
Note: Spark is a distributed processing engine and the
reduceByKey
can return the final list in any order. If you want strong ordering guarantees, enrich your RDDs to include a index element.
data1 = [1,2,3]
rdd1 = spark.sparkContext.parallelize(data1)
data2 = [7,8,9]
rdd2 = spark.sparkContext.parallelize(data2)
rdd1.cartesian(rdd2)\
.map(lambda x: (x[0], [x[0] * x[1]]))\
.reduceByKey(lambda x, y: x + y)\
.map(lambda x: x[1]).collect()
输出
[[7, 8, 9], [14, 16, 18], [21, 24, 27]]