使用并行化创建 key/value 对 RDD?
using parallelize to create a key/value pair RDD?
spark API docs 为使用并行化创建 RDD 提供了以下定义:
parallelize(c, numSlices=None)
Distribute a local Python collection to form an RDD. Using xrange is
recommended if the input represents a range for performance.
>>> sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect()
[[0], [2], [3], [4], [6]]
>>> sc.parallelize(xrange(0, 6, 2), 5).glom().collect()
[[], [0], [], [2], [4]]
我想创建一个 key/value 对 RDD,如何使用并行化来实现?示例输出 RDD:
key | value
-------+-------
panda | 0
pink | 3
pirate | 3
panda | 1
pink | 4
sc.parallelize([("panda", 0), ("pink", 3)])
sc.parallelize(顺序(("panda", 0), ("pink", 3)))
spark API docs 为使用并行化创建 RDD 提供了以下定义:
parallelize(c, numSlices=None)
Distribute a local Python collection to form an RDD. Using xrange is recommended if the input represents a range for performance.
>>> sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect() [[0], [2], [3], [4], [6]] >>> sc.parallelize(xrange(0, 6, 2), 5).glom().collect() [[], [0], [], [2], [4]]
我想创建一个 key/value 对 RDD,如何使用并行化来实现?示例输出 RDD:
key | value
-------+-------
panda | 0
pink | 3
pirate | 3
panda | 1
pink | 4
sc.parallelize([("panda", 0), ("pink", 3)])
sc.parallelize(顺序(("panda", 0), ("pink", 3)))