Apache Kafka 分组两次

Question

我正在编写一个应用程序，试图计算每小时访问某个页面的用户数量。我正在尝试过滤特定事件，按 userId 和事件小时时间分组，然后仅按小时分组以获得用户数。但是，在尝试关闭流时，对 KTable 进行分组会导致过度 cpu 烧毁和锁定。有更好的方法吗？

    events
   .groupBy(...)
   .aggregate(...)
   .groupBy(...);
   .count();

Answer 1

鉴于上述问题的答案 "I just want to know within an hour time window the number of users that perfomed a specific action"，我建议如下。

假设您有这样的记录：

class ActionRecord {
  String actionType;
  String user;
}

您可以像这样定义聚合 class：

class ActionRecordAggregate {
  private Set<String> users = new HashSet<>();

  public void add(ActionRecord rec) {
    users.add(rec.getUser());
  }

  public int count() {
    return users.size();
  }

}

那么您的流媒体应用程序可以：

接受事件
根据事件类型重新输入密钥（.map()）
按事件类型分组 (.groupByKey())
window他们按时间（选了1分钟但是YMMV）
将它们聚合成ActionRecordAggregate
将它们具体化为 StateStore

所以这看起来像：

stream()
.map((key, val) -> KeyValue.pair(val.actionType, val)) 
.groupByKey() 
.windowedBy(TimeWindows.of(60*1000)) 
.aggregate(
  ActionRecordAggregate::new, 
  (key, value, agg) -> agg.add(value),
  Materialized
      .<String, ActionRecordAggregate, WindowStore<Bytes, byte[]>>as("actionTypeLookup")
      .withValueSerde(getSerdeForActionRecordAggregate())
);

然后，要取回事件，您可以查询您的状态存储：

ReadOnlyWindowStore<String, ActionRecordAggregate> store = 
  streams.store("actionTypeLookup", QueryableStoreTypes.windowStore());

WindowStoreIterator<ActionRecordAggregate> wIt = 
  store.fetch("actionTypeToGet", startTimestamp, endTimestamp);

int totalCount = 0;
while(wIt.hasNext()) {
  totalCount += wIt.next().count();
}

// totalCount is the number of distinct users in your 
// time interval that raised action type "actionTypeToGet"

希望这对您有所帮助！

Apache Kafka 分组两次

Apache Kafka Grouping Twice

apache-kafka

apache-kafka-streams