时间序列的数据库设计

Question

我大约每 10 分钟插入 ~50 条具有相同时间戳的记录。
这意味着每小时约 600 条记录或每天 7.200 条记录或每年 2.592.000 条记录。
用户想要检索时间戳最接近请求时间的所有记录。

设计 #1 - 一个 table 在时间戳列上有索引：

    CREATE TABLE A (t timestamp, value int);
    CREATE a_idx ON A (t);

单个插入语句创建约 50 条具有相同时间戳的记录：

    INSERT INTO A VALUES (
      (‘2019-01-02 10:00’, 5),
      (‘2019-01-02 10:00’, 12),
      (‘2019-01-02 10:00’, 7),
       ….
    )

获取最接近请求时间的所有记录
（我使用 PostgreSQL 中可用的函数 greast()）：

    SELECT * FROM A WHERE t =
(SELECT t FROM A ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)

我认为这个查询效率不高，因为它需要完整的 table 扫描。
我打算按时间戳对Atable进行分区，每年分区1次，但是上面的近似匹配还是会很慢。

设计 #2 - 创建 2 tables:
1st table：保持唯一的时间戳和自动递增的 PK，
2nd table: 保存数据和外键在 1st table PK

    CREATE TABLE UNIQ_TIMESTAMP (id SERIAL PRIMARY KEY, t timestamp);
    CREATE TABLE DATA (id INTEGER REFERENCES UNIQ_TIMESTAMP (id), value int);
    CREATE INDEX data_time_idx ON DATA (id);

获取最接近请求时间的所有记录：

SELECT * FROM DATA WHERE id =
(SELECT id FROM UNIQ_TIMESTAMP ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)

与设计 #1 相比，它应该运行更快，因为嵌套 select 扫描较小的 table。
这种方法的缺点：
- 我必须插入 2 tables 而不是插入一个
- 我失去了按时间戳

划分数据 table 的能力

你能推荐什么？

Answer 1

我会采用单一 table 方法，也许按年份分区，以便更容易摆脱旧数据。

像这样创建一个索引

CREATE INDEX ON a (date_trunc('hour', t + INTERVAL '30 minutes'));

然后像您编写的那样使用您的查询，但是添加

AND date_trunc('hour', t + INTERVAL '30 minutes')
  = date_trunc('hour', asked_time + INTERVAL '30 minutes')

附加条件作为过滤器，可以使用索引。

Answer 2

您可以使用两个查询的 UNION 来查找最接近给定时间戳的所有时间戳：

(
  select t
  from a
  where t >= timestamp '2019-03-01 17:00:00'
  order by t
  limit 1
)
union all
(
  select t
  from a
  where t <= timestamp '2019-03-01 17:00:00'
  order by t desc
  limit 1
)

这将有效地利用 t 上的索引。在具有 1000 万行（~3 年数据）的 table 上，我得到以下执行计划：

Append  (cost=0.57..1.16 rows=2 width=8) (actual time=0.381..0.407 rows=2 loops=1)
  Buffers: shared hit=6 read=4
  I/O Timings: read=0.050
  ->  Limit  (cost=0.57..0.58 rows=1 width=8) (actual time=0.380..0.381 rows=1 loops=1)
        Output: a.t
        Buffers: shared hit=1 read=4
        I/O Timings: read=0.050
        ->  Index Only Scan using a_t_idx on stuff.a  (cost=0.57..253023.35 rows=30699415 width=8) (actual time=0.380..0.380 rows=1 loops=1)
              Output: a.t
              Index Cond: (a.t >= '2019-03-01 17:00:00'::timestamp without time zone)
              Heap Fetches: 0
              Buffers: shared hit=1 read=4
              I/O Timings: read=0.050
  ->  Limit  (cost=0.57..0.58 rows=1 width=8) (actual time=0.024..0.025 rows=1 loops=1)
        Output: a_1.t
        Buffers: shared hit=5
        ->  Index Only Scan Backward using a_t_idx on stuff.a a_1  (cost=0.57..649469.88 rows=78800603 width=8) (actual time=0.024..0.024 rows=1 loops=1)
              Output: a_1.t
              Index Cond: (a_1.t <= '2019-03-01 17:00:00'::timestamp without time zone)
              Heap Fetches: 0
              Buffers: shared hit=5
Planning Time: 1.823 ms
Execution Time: 0.425 ms

如您所见，它只需要很少的 I/O 操作，并且几乎与 table 大小无关。

以上可用于IN条件：

select *
from a
where t in ( 
  (select t
   from a
   where t >= timestamp '2019-03-01 17:00:00'
   order by t
   limit 1)
  union all
  (select t
   from a
   where t <= timestamp '2019-03-01 17:00:00'
   order by t desc
   limit 1)
);

如果您知道接近请求的时间戳的值永远不会超过 100 个，则可以完全删除 IN 查询并在联合的两个部分中简单地使用 limit 100。这使得查询更加高效，因为没有第二步来评估 IN 条件，但可能 return 行比您想要的多。

如果您总是寻找同一年的时间戳，那么按年份分区确实会有所帮助。

如果查询太复杂，您可以将其放入函数中：

create or replace function get_closest(p_tocheck timestamp)
  returns timestamp
as
$$
  select *
  from (
     (select t
     from a
     where t >= p_tocheck
     order by t
     limit 1)
    union all
    (select t
     from a
     where t <= p_tocheck
     order by t desc
     limit 1)
  ) x
  order by greatest(t - p_tocheck, p_tocheck - t)
  limit 1;
$$
language sql stable;

查询变得如此简单：

select *
from a
where t = get_closest(timestamp '2019-03-01 17:00:00');

另一种解决方案是使用 btree_gist 扩展，它提供 "distance" 运算符 <->

然后就可以在时间戳上创建GiST索引了：

create index on a using gist (t) ;

并使用以下查询：

select *
from a where t in (select t
                  from a
                  order by t <-> timestamp '2019-03-01 17:00:00'
                  limit 1);

时间序列的数据库设计

Database design for time series

postgresql

database-design

relational-database