在 PostgreSQL 8.4 版中将重叠日期间隔与多个分组相加的有效方法

Efficient way to sum overlapping date intervals with several groupings in PostgreSQL version 8.4

你好,我是 Whosebug 的新手,也是 psql 的新手,所以如果我做错了,请宽容。

我有一个大型数据集显示动物运动,看起来有点像这样:

animalid | movementdate | offmovementdate | location | rsk
==========================================================
1        | 1998-01-01   | 1998-04-01      | 3        | Y
1        | 1998-04-01   | 1999-04-01      | 1        | Y
1        | 1999-04-01   | 1999-07-01      | 2        | N
2        | 1998-05-01   | 1999-04-01      | 3        | Y
3        | 1998-02-01   | 1999-01-01      | 2        | N
3        | 1999-01-01   | 1999-06-01      | 1        | Y
4        | 1997-12-01   | 1998-05-01      | 1        | Y
4        | 1998-05-01   | 1999-04-01      | 2        | N

我想总结按风险分层的动物(即在与另一只动物共享的位置)的所有接触天数。间隔是 movementdate-offmovementdate。变量 lcd 应该是与另一个人在同一地点度过的天数的总和。所以如果我和 2 个人在同一个地方呆了 3 天,和 1 个人在同一个地方呆了 2 天,我的 lcd 是 3+3+2=8.

所以我的输出应该是这样的:

animalid | rsk | lcd
=======================
1        | Y   | 120
1        | N   | 0
2        | Y   | 0
3        | Y   | 90
3        | N   | 245
4        | Y   | 30
4        | N   | 245

因此,第一行中的值 120 是通过添加位置 3(0 天)和位置 1(1999-01-01 到 1999-04-01 + 1998-04-01 到 1998- 05-01).

我尝试了以下查询:

CREATE TABLE tmpcpy AS
SELECT ta.animalid,ta.location,ta.rsk,
   SUM(AGE(LEAST(ta.offmovementdate,tb.offmovementdate),
   GREATEST(ta.movementdate,tb.movementdate))) ctc_ds 
FROM tmpd ta, tmpd tb 
WHERE ta.location=tb.location 
  AND ta.animalid IS DISTINCT FROM tb.animalid
  AND LEAST(ta.offmovementdate,tb.offmovementdate) > 
      GREATEST(ta.movementdate,tb.movementdate)
GROUP BY ta.animalid, ta.rsk, ta.location;

CREATE TABLE lcd_out AS 
SELECT animalid, rsk, SUM(ctc_ds) lcd 
FROM tmpcpy 
GROUP BY animalid, rsk;

但是我收到以下错误消息。

ERROR:  could not write block 24905954 of temporary file: No space left on device

是否有更有效的方法来获得所需的输出?

使用我的真实数据集进行的第一个查询的解释输出如下:

GroupAggregate  (cost=677015920636.46..691909507980.53 rows=3804913 width=42)
->  Sort  (cost=677015920636.46..679994626690.54 rows=1191482421630 width=42)
Sort Key: ta.animalid, ta.rsk, ta.location
->  Merge Join  (cost=18773271.33..71508531671.51 rows=1191482421630 width=42)
Merge Cond: (ta.location = tb.location)
Join Filter: ((ta.animalid IS DISTINCT FROM tb.animalid) AND (LEAST(ta.offmovementdate, tb.offmovementdate) > GREATEST(ta.movementdate, tb.movementdate)))
->  Sort  (cost=9646734.67..9741857.48 rows=38049124 width=26)
Sort Key: ta.location
->  Seq Scan on moves ta  (cost=0.00..1214663.24 rows=38049124 width=26)
->  Materialize  (cost=9126536.67..9602150.72 rows=38049124 width=24)
->  Sort  (cost=9126536.67..9221659.48 rows=38049124 width=24)
Sort Key: tb.location
->  Seq Scan on moves tb  (cost=0.00..1214663.24 rows=38049124 width=24)

不知道如何处理那些带有 OFFMOVEMENTDATE 和 LOCATION null 的记录,我可以给你这个查询(它应该更有效一些,因为它不执行昂贵的自连接)只是忽略那些行:

with act_data (animalid, movementdate, offmovementdate, move, location, death, rsk) as (
        values(1, date'1998-01-01', date'1998-04-01', 1, 3, 'f', 'Y')
        union all
        values(1, date'1998-04-01', date'1999-04-01', 2, 1, 'f', 'Y')
        union all
        values(1, date'1999-04-01', date'1999-07-01', 3, 2, 'f', 'N')
        union all
        values(1, date'1999-07-01', cast(null as date), 4, cast(null as integer), 't', 'N')
        union all
        values(2, date'1998-05-01', date'1999-04-01', 1, 3, 'f', 'Y')
        union all
        values(2, date'1999-04-01', cast(null as date), 2, cast(null as integer), 't', 'N')
        union all
        values(3, date'1998-02-01', date'1999-01-01', 1, 2, 'f', 'N')
        union all
        values(3, date'1999-01-01', date'1999-06-01', 2, 1, 'f', 'Y')
        union all
        values(3, date'1999-06-01', cast(null as date), 3, cast(null as integer), 't', 'N')
        union all
        values(4, date'1997-12-01', date'1998-05-01', 1, 1, 'f', 'Y')
        union all
        values(4, date'1998-05-01', date'1999-04-01', 2, 2, 'f', 'N')
        union all
        values(4, date'1999-04-01', cast(null as date), 3, cast(null as integer), 't', 'N')
    ), my_data as (
        select row_number() over() as id,t.*
        from act_data t
    ), dates as (
        select movementdate as day
        from my_data
        union
        select offmovementdate
        from my_data
    ), my_intevals as (
        select day as start_int, lead(day) over(order by day nulls last) as end_int
        from dates
        where day is not null
        order by day nulls last
    ), intervals as (
        select row_number() over(order by start_int nulls last) as interval_id, start_int, end_int, end_int - start_int as duration
        from my_intevals
    ), overlapping_intervals as (   
        select rsk, location, interval_id, start_int, end_int, duration, array_agg(animalid) as animals
        from intervals i
            join my_data d on (i.start_int>=d.movementdate and i.end_int<=d.offmovementdate)
        group by rsk, location, interval_id, start_int, end_int, duration
        having count(*) > 1
    )
select a as animalid, i.rsk, sum(i.duration) as lcd
from overlapping_intervals i
    cross join unnest(animals) a
group by a, i.rsk
order by animalid, i.rsk

它returns你的例外输出

 animalid | rsk | lcd
----------+-----+-----
        1 | Y   | 120
        3 | N   | 245
        3 | Y   |  90
        4 | N   | 245
        4 | Y   |  30

更新

要在 8.4 上执行相同的提取而不对数组列使用交叉连接,您可以使用以下脚本。使用您的主要名称 table 切换对 my_data 的引用,如果您的环境中已有位置 table,请使用它代替评估的位置。它在不同的位置重新执行相同的查询,以分几步填充临时 table。您还可以在每个循环结束时提交以检查执行时间是否为acceptable.

create table my_data (animalid, movementdate, offmovementdate, move, location, death, rsk) as (
    values(1, date'1998-01-01', date'1998-04-01', 1, 3, 'f', 'Y')
    union all
    values(1, date'1998-04-01', date'1999-04-01', 2, 1, 'f', 'Y')
    union all
    values(1, date'1999-04-01', date'1999-07-01', 3, 2, 'f', 'N')
    union all
    values(1, date'1999-07-01', cast(null as date), 4, cast(null as integer), 't', 'N')
    union all
    values(2, date'1998-05-01', date'1999-04-01', 1, 3, 'f', 'Y')
    union all
    values(2, date'1999-04-01', cast(null as date), 2, cast(null as integer), 't', 'N')
    union all
    values(3, date'1998-02-01', date'1999-01-01', 1, 2, 'f', 'N')
    union all
    values(3, date'1999-01-01', date'1999-06-01', 2, 1, 'f', 'Y')
    union all
    values(3, date'1999-06-01', cast(null as date), 3, cast(null as integer), 't', 'N')
    union all
    values(4, date'1997-12-01', date'1998-05-01', 1, 1, 'f', 'Y')
    union all
    values(4, date'1998-05-01', date'1999-04-01', 2, 2, 'f', 'N')
    union all
    values(4, date'1999-04-01', cast(null as date), 3, cast(null as integer), 't', 'N')
);

create table locations as (
    select distinct location
    from my_data
    where location is not null
);

create local temp table tmp_result_table (
    animalid bigint,
    location bigint,
    rsk text,
    lcd bigint
) ON COMMIT preserve ROWS;


DO $$DECLARE r record;
BEGIN
    FOR r IN SELECT location FROM locations
    LOOP

    insert into tmp_result_table(animalid, rsk , lcd)
        with dates as (
                select movementdate as day
                from my_data d
                where d.location = r.location
                union
                select offmovementdate
                from my_data d
                where d.location = r.location
            ), intervals as (
                select start_int, end_int, end_int - start_int as duration
                from (
                        select day as start_int, lead(day) over(order by day nulls last) as end_int
                        from dates
                        where day is not null
                    ) a
            ), overlapping_intervals as (   
                select rsk, start_int, end_int, duration, array_agg(animalid) as animals,
                    count(*)-1 as factor
                from intervals i
                    join my_data d on (i.start_int>=d.movementdate and i.end_int<=d.offmovementdate)
                where d.location = r.location
                group by rsk, start_int, end_int, duration
                having count(*) > 1
            )
        select unnest(animals), rsk, lcd
        from (
                select rsk, animals, sum(duration*factor) as lcd
                from overlapping_intervals
                group by rsk, animals
            ) a;
    END LOOP;
    RETURN;
END;$$



select animalid, rsk, sum(lcd) as lcd
from tmp_result_table
group by animalid, rsk
order by animalid, rsk desc;