如何在pyspark中查看RDD中每个分区的内容?
How to see the contents of each partition in an RDD in pyspark?
我想进一步了解 pyspark 如何对数据进行分区。我需要这样的函数:
a = sc.parallelize(range(10), 5)
show_partitions(a)
#output:[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]] (or however it partitions)
glom函数就是你要找的:
glom(self): Return an RDD created by coalescing all elements within each partition into a list.
a = sc.parallelize(range(10), 5)
a.glom().collect()
#output:[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
val data = List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8))
val rdd = sc.parallelize(data)
rdd.glom().collect()
.foreach(a => {
a.foreach(println);
println("=====")})
通过这种方式,您可以查看数据是如何分区的
我想进一步了解 pyspark 如何对数据进行分区。我需要这样的函数:
a = sc.parallelize(range(10), 5)
show_partitions(a)
#output:[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]] (or however it partitions)
glom函数就是你要找的:
glom(self): Return an RDD created by coalescing all elements within each partition into a list.
a = sc.parallelize(range(10), 5)
a.glom().collect()
#output:[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
val data = List((1,3),(1,2),(1,4),(2,3),(3,6),(3,8))
val rdd = sc.parallelize(data)
rdd.glom().collect()
.foreach(a => {
a.foreach(println);
println("=====")})
通过这种方式,您可以查看数据是如何分区的