Kafka分区+Spark流上下文

Kafka Partition+Spark Streaming Context

场景-我有 1 个主题和 2 个具有不同数据集集合的分区,比如 A,B.I 我知道 dstream 可以在分区级别和主题级别使用消息。 查询——我们可以为每个分区使用两个不同的流上下文,还是为整个主题使用一个流上下文,然后过滤分区级别的数据?我担心增加流上下文数量的性能。

引用文档。

Simplified Parallelism: No need to create multiple input Kafka streams and union them. With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD partitions, which is easier to understand and tune.

因此,如果您使用基于 Direct Stream 的 Spark Streaming 消费者,它应该处理并行性。