在 Python Spark RDD 中组合两条不同的线
Combining two different lines in Python Spark RDD
我在处理 python spark rdd 时遇到了小问题。我的 rdd 看起来像
old_rdd = [( A1, Vector(V1)), (A2, Vector(V2)), (A3, Vector(V3)), ....].
我想使用 flatMap,以便获得新的 rdd,例如:
new_rdd = [((A1, A2), (V1, V2)), ((A1, A3), (V1, V3))] and so on.
问题是 flatMap 删除了像 [(A1, V1, A2, V2)...].
这样的元组,你有没有 flatMap() 的替代建议。先感谢您。
与Explicit sort in Cartesian transformation in Scala Spark有关。但是,我假设您已经清除了 RDD
的重复项,并且我假设 ids
有一些简单的模式可以解析然后识别,为了简单起见,我会考虑 [=13] =] 而不是 Vectors
old_rdd = sc.parallelize([(1, [1, -2]), (2, [5, 7]), (3, [8, 23]), (4, [-1, 90])])
# It will provide all the permutations, but combinations are a subset of the permutations, so we need to filter.
combined_rdd = old_rdd.cartesian(old_
combinations = combined_rdd.filter(lambda (s1, s2): s1[0] < s2[0])
combinations.collect()
# The output will be...
# -----------------------------
# [((1, [1, -2]), (2, [5, 7])),
# ((1, [1, -2]), (3, [8, 23])),
# ((1, [1, -2]), (4, [-1, 90])),
# ((2, [5, 7]), (3, [8, 23])),
# ((2, [5, 7]), (4, [-1, 90])),
# ((3, [8, 23]), (4, [-1, 90]))]
# Now we need to set the tuple as you want
combinations = combinations.map(lambda (s1, s1): ((s1[0], s2[0]), (s1[1], s2[1]))).collect()
# The output will be...
# ----------------------
# [((1, 2), ([1, -2], [5, 7])),
# ((1, 3), ([1, -2], [8, 23])),
# ((1, 4), ([1, -2], [-1, 90])),
# ((2, 3), ([5, 7], [8, 23])),
# ((2, 4), ([5, 7], [-1, 90])),
# ((3, 4), ([8, 23], [-1, 90]))]
我在处理 python spark rdd 时遇到了小问题。我的 rdd 看起来像
old_rdd = [( A1, Vector(V1)), (A2, Vector(V2)), (A3, Vector(V3)), ....].
我想使用 flatMap,以便获得新的 rdd,例如:
new_rdd = [((A1, A2), (V1, V2)), ((A1, A3), (V1, V3))] and so on.
问题是 flatMap 删除了像 [(A1, V1, A2, V2)...].
这样的元组,你有没有 flatMap() 的替代建议。先感谢您。
与Explicit sort in Cartesian transformation in Scala Spark有关。但是,我假设您已经清除了 RDD
的重复项,并且我假设 ids
有一些简单的模式可以解析然后识别,为了简单起见,我会考虑 [=13] =] 而不是 Vectors
old_rdd = sc.parallelize([(1, [1, -2]), (2, [5, 7]), (3, [8, 23]), (4, [-1, 90])])
# It will provide all the permutations, but combinations are a subset of the permutations, so we need to filter.
combined_rdd = old_rdd.cartesian(old_
combinations = combined_rdd.filter(lambda (s1, s2): s1[0] < s2[0])
combinations.collect()
# The output will be...
# -----------------------------
# [((1, [1, -2]), (2, [5, 7])),
# ((1, [1, -2]), (3, [8, 23])),
# ((1, [1, -2]), (4, [-1, 90])),
# ((2, [5, 7]), (3, [8, 23])),
# ((2, [5, 7]), (4, [-1, 90])),
# ((3, [8, 23]), (4, [-1, 90]))]
# Now we need to set the tuple as you want
combinations = combinations.map(lambda (s1, s1): ((s1[0], s2[0]), (s1[1], s2[1]))).collect()
# The output will be...
# ----------------------
# [((1, 2), ([1, -2], [5, 7])),
# ((1, 3), ([1, -2], [8, 23])),
# ((1, 4), ([1, -2], [-1, 90])),
# ((2, 3), ([5, 7], [8, 23])),
# ((2, 4), ([5, 7], [-1, 90])),
# ((3, 4), ([8, 23], [-1, 90]))]