DataFrame 的 rdd 无法更改 Spark Structured Streaming 中的分区号 python
rdd of DataFrame could not change partition number in Spark Structured Streaming python
我在 Spark Structured Streaming 中使用以下 pyspark 代码从 Redis 获取数据帧
def process(stream_batch, batch_id):
stream_batch.persist()
length = stream_batch.count()
record_rdd = stream_batch.rdd.map(lambda x: b_to_ndarray(x['data']))
# b_to_ndarray is a single thread method to convert bytes in Redis to ndarray
record_rdd = record_rdd.coalesce(4) # does not work
print(record_rdd.getNumPartitions()) # output 1
# some other code
为什么?如何解决? main中的代码是
loadedDf = spark.readStream.format('redis')...
query = loadedDf.writeStream \
.foreachBatch(process).start()
query.awaitTermination()
由于一开始partitionNum是1,coalesce
操作不允许生成更少的分区。所以不管你怎么称呼它,它都是1个分区。除非你使用 repartition
我在 Spark Structured Streaming 中使用以下 pyspark 代码从 Redis 获取数据帧
def process(stream_batch, batch_id):
stream_batch.persist()
length = stream_batch.count()
record_rdd = stream_batch.rdd.map(lambda x: b_to_ndarray(x['data']))
# b_to_ndarray is a single thread method to convert bytes in Redis to ndarray
record_rdd = record_rdd.coalesce(4) # does not work
print(record_rdd.getNumPartitions()) # output 1
# some other code
为什么?如何解决? main中的代码是
loadedDf = spark.readStream.format('redis')...
query = loadedDf.writeStream \
.foreachBatch(process).start()
query.awaitTermination()
由于一开始partitionNum是1,coalesce
操作不允许生成更少的分区。所以不管你怎么称呼它,它都是1个分区。除非你使用 repartition