聚合过度聚合

Aggregation over aggregation

我有时间序列数据(证券交易所交易),我需要按时间间隔汇总它们:一分钟、5 分钟、15 分钟等。 高级时间框架可以从次要时间框架计算,即 5 x 1 分钟 -> 5 分钟。

我做了MATERIALIZED VIEW, AggregatingMergeTree,成功计算了m1,like

maxState(price) as price_high, countState(item_id) as trades_count

但我不知道如何制定下一个时间表。如果我在下一个视图中使用 maxMerge 我 return 一个不正确的结果,这很好,因为文档说我必须在 AggregatingMergeTree 中使用 -state,当我使用 -State在 m5 中它也抱怨错误。

我想构建一系列物化视图,其中次要视图通过管道向高级视图提供来自交易的更新

更新(SQL):

CREATE MATERIALIZED VIEW IF NOT EXISTS candle_m1_state
ENGINE = AggregatingMergeTree() PARTITION BY toYYYYMM(toDateTime(timestamp_close_m1/1000)) 
ORDER BY (platform_id, symbol, timestamp_close_m1)
POPULATE AS
select
 platform_id as platform_id,
 symbol as symbol,
 '1m' as `candle_interval`,
 1000*toUnixTimestamp(toStartOfMinute(toDateTime(timestamp/1000))) as timestamp_m1,
 1000*toUnixTimestamp(addMinutes(toStartOfMinute(toDateTime(timestamp/1000)), 1)) as timestamp_close_m1,
...
 minState(price) as price_low,
 countState(item_id) as trades_count
from trade
group by platform_id, symbol, timestamp_m1, timestamp_close_m1, `candle_interval`
order by timestamp_close_m1;

/*The one below definitely wrong due to -State suffix*/
CREATE MATERIALIZED VIEW IF NOT EXISTS candle_m5_test
ENGINE = AggregatingMergeTree() PARTITION BY toYYYYMM(toDateTime(timestamp_close_m5 / 1000)) 
ORDER BY (platform_id, symbol, timestamp_close_m5) SETTINGS index_granularity = 8192 
POPULATE AS 
SELECT platform_id, symbol, '5m' AS candle_interval,
 1000 * toUnixTimestamp(toStartOfFiveMinute(toDateTime(timestamp_m1 / 1000))) AS timestamp_m5,
 1000 * toUnixTimestamp(addMinutes(toStartOfFiveMinute(toDateTime(timestamp_m1 / 1000)), 5)) AS timestamp_close_m5, 
 ...
 minState(price_low) AS price_low, 
 countState(trades_count) AS trades_count 
FROM candle_m1_state 
GROUP BY platform_id, symbol, timestamp_m5, timestamp_close_m5 
ORDER BY platform_id ASC, symbol ASC, timestamp_close_m5 ASC;

我不会尝试链接视图。我会为每个聚合做一个视图。

还要记住 MATERIALIZED VIEW 与其说是视图,不如说是触发。

我会推荐:

CREATE MATERIALIZED VIEW
    stream__source__target_5m TO target_5m
AS
SELECT ...
CREATE MATERIALIZED VIEW
    stream__source__target_1m TO target_1m
AS
SELECT ...

等等

其中 target_xm 是您的目标表。

很明显,select-物化视图链的查询时间我想坚持使用该解决方案,而不是从原始数据为每个时间框架 (TF) 聚合创建视图。

所以解决方案是:

原始原始数据->TF1物化视图(AggregatingMergeTree,-State后缀)->TF2(来自TF1)(AggregatingMergeTree,-MergeState 后缀)

然后查询任何 TF1、TF2.. 带有 -Merge 后缀