Apache Flink 中的前 n 个查询使用了多少状态?

How much state is used for top-n queries in Apache Flink?

我想知道一般有多少状态 Apache Flink uses for Top-N 查询和 table。

首先,我使用 Flink SQL 处理来自 Kafka 主题的消息:

CREATE TABLE purchases (
  country STRING,
  product STRING
) WITH (
   'connector' = 'kafka',
   'topic' = 'purchases',
   'properties.bootstrap.servers' = 'kafka:29092',
   'value.format' = 'json',
   'properties.group.id' = '1',
   'scan.startup.mode' = 'earliest-offset'
);

我还初始化了一个 JDBC 连接器:

CREATE TABLE aggregations (
  `country` STRING,
  `product` STRING,
  `purchases` BIGINT NOT NULL,
  PRIMARY KEY (`country`, `product`) NOT ENFORCED
) WITH (
  'connector' = 'jdbc',
  'url' = 'jdbc:postgresql://postgres:5432/postgres?&user=postgres&password=postgres',
  'table-name' = 'aggregations'
);

终于开始聚合了:

insert into aggregations
SELECT `country`, `product`, `purchases`
FROM (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY country ORDER BY `purchases` DESC) AS row_num
  FROM (select country, product, count(*) as `purchases` from purchases group by country, product))
WHERE row_num <= 3;

来自 Flink 状态管理 docs 说:

Conceptually, source tables are never kept entirely in state. An implementer deals with logical tables (i.e. dynamic tables). Their state requirements depend on the used operations.

那么我是否正确理解 Flink 不保存来自 Kafka 连接器的 purchases table 行?

更重要的是,在聚合中:

select country, product, count(*) as `purchases` from purchases group by country, product

Flink 是否保留每个国家/地区的产品密钥状态?

Flink 会将 SQL/Table API 转换为 DataStream/DataSet 运算符。例如。对于SQL中的purchasestable,在DataStream中会被转换成FlinkKafkaConsumer

你是对的。 Flinks 不会将 Flink 的数据保存到状态中,而是将 Kafka 的偏移量保存到状态中。

对于select and group by语句,是的,Flink会将key和values(count)保存在states中。

使用 Flink SQL 或 Table API 时,您的传入流将转换为动态 table。您的 top-n 是一个连续的查询,它会累积状态。这在 https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/concepts/dynamic_tables/

中有更详细的解释

您的 Top-N 查询会累积状态,如 https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/sql/queries/topn/. There's also the Window Top-N, as explained at https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/sql/queries/window-topn/ 中所述。后者提到Moreover, window Top-N purges all intermediate state when no longer needed.。与 Top-N 相比,Window Top-N 对状态更友好。