我应该在 Akka 流中的 Kafka 源之后添加缓冲区吗

Should I add a buffer after a Kafka source in Akka stream

根据this blog post

If the source of the stream polls an external entity for new messages and the downstream processing is non-uniform, inserting a buffer can be crucial to realizing good throughput. For example, a large buffer inserted after the Kafka Consumer from the Reactive Streams Kafka library can improve performance by an order of magnitude in some situations. Otherwise, the source may not poll Kafka fast enough to keep the downstream saturated with work, with the source oscillating between backpressuring and polling Kafka.

alpakka kafka connnector 的文档没有提到这一点,所以我想知道在这种情况下使用缓冲区是否有意义。同样的事情也适用于 Kafka sinks(我应该在之前添加一个缓冲区)吗?

...I was wondering if it makes sense to use a buffer in this case

考虑您引用的博客 post 中的以下片段:

...the downstream processing is non-uniform....

post 该部分的要点之一是说明用户定义的缓冲区和异步边界对流的类似影响。没有缓冲区或异步边界的默认行为是启用 operator fusion,它在单个 actor 中运行流。这实质上意味着对于每条被消费的 Kafka 消息,在下一条消息通过管道之前,该消息必须通过流的整个管道,从源到接收器。换句话说,消息 m2 将不会通过管道,直到前面的消息 m1 完成处理。

如果从 Kafka 连接器源下游发生的处理是 "non-uniform"(即,它可能需要不同的时间:有时处理发生得很快,有时需要一段时间),然后引入缓冲区或者异步边界可以提高整体吞吐量。这是因为缓冲区或异步边界可以允许源继续使用 Kafka 消息,即使下游处理恰好需要很长时间。也就是说,如果 m1 需要很长时间来处理,源可以使用消息 m2m3 等(直到缓冲区已满),而无需等待 m1 完成。正如 Colin Breck 在他的 post 中所说:

The buffer improves performance by decoupling stages, allowing the upstream or downstream to continue to process elements, on average, even if one of them is busy processing a relatively expensive workload.

这种潜在的性能提升并不适用于所有情况。再次引用布雷克:

Similar to the async method discussed in the previous section, it should be noted that inserting buffers indiscriminately will not improve performance and simply consume additional resources. If adjacent workloads are relatively uniform, the addition of a buffer will not change the performance, as the overall performance of the stream will simply be dominated by the slowest processing stage.

确定在您的情况下使用缓冲区(即 .buffer)是否有意义的一个明显方法是尝试一下。您也可以尝试添加异步边界(即 .async)。比较以下三种方法——(1) 没有缓冲的默认融合行为,(2) .buffer,和 (3) .async——看看哪一种方法的性能最好。