DataFrame 的 rdd 无法更改 Spark Structured Streaming 中的分区号 python

rdd of DataFrame could not change partition number in Spark Structured Streaming python

我在 Spark Structured Streaming 中使用以下 pyspark 代码从 Redis 获取数据帧

def process(stream_batch, batch_id):
    stream_batch.persist()
    length = stream_batch.count()

    record_rdd = stream_batch.rdd.map(lambda x: b_to_ndarray(x['data']))
    # b_to_ndarray is a single thread method to convert bytes in Redis to ndarray

    record_rdd = record_rdd.coalesce(4) # does not work

    print(record_rdd.getNumPartitions()) # output 1
    # some other code

为什么?如何解决? main中的代码是

loadedDf = spark.readStream.format('redis')...
query = loadedDf.writeStream \
    .foreachBatch(process).start()
query.awaitTermination()

由于一开始partitionNum是1,coalesce操作不允许生成更少的分区。所以不管你怎么称呼它,它都是1个分区。除非你使用 repartition