在 PostgreSQL 8.4 版中将重叠日期间隔与多个分组相加的有效方法
Efficient way to sum overlapping date intervals with several groupings in PostgreSQL version 8.4
你好,我是 Whosebug 的新手,也是 psql 的新手,所以如果我做错了,请宽容。
我有一个大型数据集显示动物运动,看起来有点像这样:
animalid | movementdate | offmovementdate | location | rsk
==========================================================
1 | 1998-01-01 | 1998-04-01 | 3 | Y
1 | 1998-04-01 | 1999-04-01 | 1 | Y
1 | 1999-04-01 | 1999-07-01 | 2 | N
2 | 1998-05-01 | 1999-04-01 | 3 | Y
3 | 1998-02-01 | 1999-01-01 | 2 | N
3 | 1999-01-01 | 1999-06-01 | 1 | Y
4 | 1997-12-01 | 1998-05-01 | 1 | Y
4 | 1998-05-01 | 1999-04-01 | 2 | N
我想总结按风险分层的动物(即在与另一只动物共享的位置)的所有接触天数。间隔是 movementdate-offmovementdate。变量 lcd 应该是与另一个人在同一地点度过的天数的总和。所以如果我和 2 个人在同一个地方呆了 3 天,和 1 个人在同一个地方呆了 2 天,我的 lcd 是 3+3+2=8.
所以我的输出应该是这样的:
animalid | rsk | lcd
=======================
1 | Y | 120
1 | N | 0
2 | Y | 0
3 | Y | 90
3 | N | 245
4 | Y | 30
4 | N | 245
因此,第一行中的值 120 是通过添加位置 3(0 天)和位置 1(1999-01-01 到 1999-04-01 + 1998-04-01 到 1998- 05-01).
我尝试了以下查询:
CREATE TABLE tmpcpy AS
SELECT ta.animalid,ta.location,ta.rsk,
SUM(AGE(LEAST(ta.offmovementdate,tb.offmovementdate),
GREATEST(ta.movementdate,tb.movementdate))) ctc_ds
FROM tmpd ta, tmpd tb
WHERE ta.location=tb.location
AND ta.animalid IS DISTINCT FROM tb.animalid
AND LEAST(ta.offmovementdate,tb.offmovementdate) >
GREATEST(ta.movementdate,tb.movementdate)
GROUP BY ta.animalid, ta.rsk, ta.location;
CREATE TABLE lcd_out AS
SELECT animalid, rsk, SUM(ctc_ds) lcd
FROM tmpcpy
GROUP BY animalid, rsk;
但是我收到以下错误消息。
ERROR: could not write block 24905954 of temporary file: No space left on device
是否有更有效的方法来获得所需的输出?
使用我的真实数据集进行的第一个查询的解释输出如下:
GroupAggregate (cost=677015920636.46..691909507980.53 rows=3804913 width=42)
-> Sort (cost=677015920636.46..679994626690.54 rows=1191482421630 width=42)
Sort Key: ta.animalid, ta.rsk, ta.location
-> Merge Join (cost=18773271.33..71508531671.51 rows=1191482421630 width=42)
Merge Cond: (ta.location = tb.location)
Join Filter: ((ta.animalid IS DISTINCT FROM tb.animalid) AND (LEAST(ta.offmovementdate, tb.offmovementdate) > GREATEST(ta.movementdate, tb.movementdate)))
-> Sort (cost=9646734.67..9741857.48 rows=38049124 width=26)
Sort Key: ta.location
-> Seq Scan on moves ta (cost=0.00..1214663.24 rows=38049124 width=26)
-> Materialize (cost=9126536.67..9602150.72 rows=38049124 width=24)
-> Sort (cost=9126536.67..9221659.48 rows=38049124 width=24)
Sort Key: tb.location
-> Seq Scan on moves tb (cost=0.00..1214663.24 rows=38049124 width=24)
不知道如何处理那些带有 OFFMOVEMENTDATE 和 LOCATION null 的记录,我可以给你这个查询(它应该更有效一些,因为它不执行昂贵的自连接)只是忽略那些行:
with act_data (animalid, movementdate, offmovementdate, move, location, death, rsk) as (
values(1, date'1998-01-01', date'1998-04-01', 1, 3, 'f', 'Y')
union all
values(1, date'1998-04-01', date'1999-04-01', 2, 1, 'f', 'Y')
union all
values(1, date'1999-04-01', date'1999-07-01', 3, 2, 'f', 'N')
union all
values(1, date'1999-07-01', cast(null as date), 4, cast(null as integer), 't', 'N')
union all
values(2, date'1998-05-01', date'1999-04-01', 1, 3, 'f', 'Y')
union all
values(2, date'1999-04-01', cast(null as date), 2, cast(null as integer), 't', 'N')
union all
values(3, date'1998-02-01', date'1999-01-01', 1, 2, 'f', 'N')
union all
values(3, date'1999-01-01', date'1999-06-01', 2, 1, 'f', 'Y')
union all
values(3, date'1999-06-01', cast(null as date), 3, cast(null as integer), 't', 'N')
union all
values(4, date'1997-12-01', date'1998-05-01', 1, 1, 'f', 'Y')
union all
values(4, date'1998-05-01', date'1999-04-01', 2, 2, 'f', 'N')
union all
values(4, date'1999-04-01', cast(null as date), 3, cast(null as integer), 't', 'N')
), my_data as (
select row_number() over() as id,t.*
from act_data t
), dates as (
select movementdate as day
from my_data
union
select offmovementdate
from my_data
), my_intevals as (
select day as start_int, lead(day) over(order by day nulls last) as end_int
from dates
where day is not null
order by day nulls last
), intervals as (
select row_number() over(order by start_int nulls last) as interval_id, start_int, end_int, end_int - start_int as duration
from my_intevals
), overlapping_intervals as (
select rsk, location, interval_id, start_int, end_int, duration, array_agg(animalid) as animals
from intervals i
join my_data d on (i.start_int>=d.movementdate and i.end_int<=d.offmovementdate)
group by rsk, location, interval_id, start_int, end_int, duration
having count(*) > 1
)
select a as animalid, i.rsk, sum(i.duration) as lcd
from overlapping_intervals i
cross join unnest(animals) a
group by a, i.rsk
order by animalid, i.rsk
它returns你的例外输出
animalid | rsk | lcd
----------+-----+-----
1 | Y | 120
3 | N | 245
3 | Y | 90
4 | N | 245
4 | Y | 30
更新
要在 8.4 上执行相同的提取而不对数组列使用交叉连接,您可以使用以下脚本。使用您的主要名称 table 切换对 my_data 的引用,如果您的环境中已有位置 table,请使用它代替评估的位置。它在不同的位置重新执行相同的查询,以分几步填充临时 table。您还可以在每个循环结束时提交以检查执行时间是否为acceptable.
create table my_data (animalid, movementdate, offmovementdate, move, location, death, rsk) as (
values(1, date'1998-01-01', date'1998-04-01', 1, 3, 'f', 'Y')
union all
values(1, date'1998-04-01', date'1999-04-01', 2, 1, 'f', 'Y')
union all
values(1, date'1999-04-01', date'1999-07-01', 3, 2, 'f', 'N')
union all
values(1, date'1999-07-01', cast(null as date), 4, cast(null as integer), 't', 'N')
union all
values(2, date'1998-05-01', date'1999-04-01', 1, 3, 'f', 'Y')
union all
values(2, date'1999-04-01', cast(null as date), 2, cast(null as integer), 't', 'N')
union all
values(3, date'1998-02-01', date'1999-01-01', 1, 2, 'f', 'N')
union all
values(3, date'1999-01-01', date'1999-06-01', 2, 1, 'f', 'Y')
union all
values(3, date'1999-06-01', cast(null as date), 3, cast(null as integer), 't', 'N')
union all
values(4, date'1997-12-01', date'1998-05-01', 1, 1, 'f', 'Y')
union all
values(4, date'1998-05-01', date'1999-04-01', 2, 2, 'f', 'N')
union all
values(4, date'1999-04-01', cast(null as date), 3, cast(null as integer), 't', 'N')
);
create table locations as (
select distinct location
from my_data
where location is not null
);
create local temp table tmp_result_table (
animalid bigint,
location bigint,
rsk text,
lcd bigint
) ON COMMIT preserve ROWS;
DO $$DECLARE r record;
BEGIN
FOR r IN SELECT location FROM locations
LOOP
insert into tmp_result_table(animalid, rsk , lcd)
with dates as (
select movementdate as day
from my_data d
where d.location = r.location
union
select offmovementdate
from my_data d
where d.location = r.location
), intervals as (
select start_int, end_int, end_int - start_int as duration
from (
select day as start_int, lead(day) over(order by day nulls last) as end_int
from dates
where day is not null
) a
), overlapping_intervals as (
select rsk, start_int, end_int, duration, array_agg(animalid) as animals,
count(*)-1 as factor
from intervals i
join my_data d on (i.start_int>=d.movementdate and i.end_int<=d.offmovementdate)
where d.location = r.location
group by rsk, start_int, end_int, duration
having count(*) > 1
)
select unnest(animals), rsk, lcd
from (
select rsk, animals, sum(duration*factor) as lcd
from overlapping_intervals
group by rsk, animals
) a;
END LOOP;
RETURN;
END;$$
select animalid, rsk, sum(lcd) as lcd
from tmp_result_table
group by animalid, rsk
order by animalid, rsk desc;
你好,我是 Whosebug 的新手,也是 psql 的新手,所以如果我做错了,请宽容。
我有一个大型数据集显示动物运动,看起来有点像这样:
animalid | movementdate | offmovementdate | location | rsk
==========================================================
1 | 1998-01-01 | 1998-04-01 | 3 | Y
1 | 1998-04-01 | 1999-04-01 | 1 | Y
1 | 1999-04-01 | 1999-07-01 | 2 | N
2 | 1998-05-01 | 1999-04-01 | 3 | Y
3 | 1998-02-01 | 1999-01-01 | 2 | N
3 | 1999-01-01 | 1999-06-01 | 1 | Y
4 | 1997-12-01 | 1998-05-01 | 1 | Y
4 | 1998-05-01 | 1999-04-01 | 2 | N
我想总结按风险分层的动物(即在与另一只动物共享的位置)的所有接触天数。间隔是 movementdate-offmovementdate。变量 lcd 应该是与另一个人在同一地点度过的天数的总和。所以如果我和 2 个人在同一个地方呆了 3 天,和 1 个人在同一个地方呆了 2 天,我的 lcd 是 3+3+2=8.
所以我的输出应该是这样的:
animalid | rsk | lcd
=======================
1 | Y | 120
1 | N | 0
2 | Y | 0
3 | Y | 90
3 | N | 245
4 | Y | 30
4 | N | 245
因此,第一行中的值 120 是通过添加位置 3(0 天)和位置 1(1999-01-01 到 1999-04-01 + 1998-04-01 到 1998- 05-01).
我尝试了以下查询:
CREATE TABLE tmpcpy AS
SELECT ta.animalid,ta.location,ta.rsk,
SUM(AGE(LEAST(ta.offmovementdate,tb.offmovementdate),
GREATEST(ta.movementdate,tb.movementdate))) ctc_ds
FROM tmpd ta, tmpd tb
WHERE ta.location=tb.location
AND ta.animalid IS DISTINCT FROM tb.animalid
AND LEAST(ta.offmovementdate,tb.offmovementdate) >
GREATEST(ta.movementdate,tb.movementdate)
GROUP BY ta.animalid, ta.rsk, ta.location;
CREATE TABLE lcd_out AS
SELECT animalid, rsk, SUM(ctc_ds) lcd
FROM tmpcpy
GROUP BY animalid, rsk;
但是我收到以下错误消息。
ERROR: could not write block 24905954 of temporary file: No space left on device
是否有更有效的方法来获得所需的输出?
使用我的真实数据集进行的第一个查询的解释输出如下:
GroupAggregate (cost=677015920636.46..691909507980.53 rows=3804913 width=42)
-> Sort (cost=677015920636.46..679994626690.54 rows=1191482421630 width=42)
Sort Key: ta.animalid, ta.rsk, ta.location
-> Merge Join (cost=18773271.33..71508531671.51 rows=1191482421630 width=42)
Merge Cond: (ta.location = tb.location)
Join Filter: ((ta.animalid IS DISTINCT FROM tb.animalid) AND (LEAST(ta.offmovementdate, tb.offmovementdate) > GREATEST(ta.movementdate, tb.movementdate)))
-> Sort (cost=9646734.67..9741857.48 rows=38049124 width=26)
Sort Key: ta.location
-> Seq Scan on moves ta (cost=0.00..1214663.24 rows=38049124 width=26)
-> Materialize (cost=9126536.67..9602150.72 rows=38049124 width=24)
-> Sort (cost=9126536.67..9221659.48 rows=38049124 width=24)
Sort Key: tb.location
-> Seq Scan on moves tb (cost=0.00..1214663.24 rows=38049124 width=24)
不知道如何处理那些带有 OFFMOVEMENTDATE 和 LOCATION null 的记录,我可以给你这个查询(它应该更有效一些,因为它不执行昂贵的自连接)只是忽略那些行:
with act_data (animalid, movementdate, offmovementdate, move, location, death, rsk) as (
values(1, date'1998-01-01', date'1998-04-01', 1, 3, 'f', 'Y')
union all
values(1, date'1998-04-01', date'1999-04-01', 2, 1, 'f', 'Y')
union all
values(1, date'1999-04-01', date'1999-07-01', 3, 2, 'f', 'N')
union all
values(1, date'1999-07-01', cast(null as date), 4, cast(null as integer), 't', 'N')
union all
values(2, date'1998-05-01', date'1999-04-01', 1, 3, 'f', 'Y')
union all
values(2, date'1999-04-01', cast(null as date), 2, cast(null as integer), 't', 'N')
union all
values(3, date'1998-02-01', date'1999-01-01', 1, 2, 'f', 'N')
union all
values(3, date'1999-01-01', date'1999-06-01', 2, 1, 'f', 'Y')
union all
values(3, date'1999-06-01', cast(null as date), 3, cast(null as integer), 't', 'N')
union all
values(4, date'1997-12-01', date'1998-05-01', 1, 1, 'f', 'Y')
union all
values(4, date'1998-05-01', date'1999-04-01', 2, 2, 'f', 'N')
union all
values(4, date'1999-04-01', cast(null as date), 3, cast(null as integer), 't', 'N')
), my_data as (
select row_number() over() as id,t.*
from act_data t
), dates as (
select movementdate as day
from my_data
union
select offmovementdate
from my_data
), my_intevals as (
select day as start_int, lead(day) over(order by day nulls last) as end_int
from dates
where day is not null
order by day nulls last
), intervals as (
select row_number() over(order by start_int nulls last) as interval_id, start_int, end_int, end_int - start_int as duration
from my_intevals
), overlapping_intervals as (
select rsk, location, interval_id, start_int, end_int, duration, array_agg(animalid) as animals
from intervals i
join my_data d on (i.start_int>=d.movementdate and i.end_int<=d.offmovementdate)
group by rsk, location, interval_id, start_int, end_int, duration
having count(*) > 1
)
select a as animalid, i.rsk, sum(i.duration) as lcd
from overlapping_intervals i
cross join unnest(animals) a
group by a, i.rsk
order by animalid, i.rsk
它returns你的例外输出
animalid | rsk | lcd
----------+-----+-----
1 | Y | 120
3 | N | 245
3 | Y | 90
4 | N | 245
4 | Y | 30
更新
要在 8.4 上执行相同的提取而不对数组列使用交叉连接,您可以使用以下脚本。使用您的主要名称 table 切换对 my_data 的引用,如果您的环境中已有位置 table,请使用它代替评估的位置。它在不同的位置重新执行相同的查询,以分几步填充临时 table。您还可以在每个循环结束时提交以检查执行时间是否为acceptable.
create table my_data (animalid, movementdate, offmovementdate, move, location, death, rsk) as (
values(1, date'1998-01-01', date'1998-04-01', 1, 3, 'f', 'Y')
union all
values(1, date'1998-04-01', date'1999-04-01', 2, 1, 'f', 'Y')
union all
values(1, date'1999-04-01', date'1999-07-01', 3, 2, 'f', 'N')
union all
values(1, date'1999-07-01', cast(null as date), 4, cast(null as integer), 't', 'N')
union all
values(2, date'1998-05-01', date'1999-04-01', 1, 3, 'f', 'Y')
union all
values(2, date'1999-04-01', cast(null as date), 2, cast(null as integer), 't', 'N')
union all
values(3, date'1998-02-01', date'1999-01-01', 1, 2, 'f', 'N')
union all
values(3, date'1999-01-01', date'1999-06-01', 2, 1, 'f', 'Y')
union all
values(3, date'1999-06-01', cast(null as date), 3, cast(null as integer), 't', 'N')
union all
values(4, date'1997-12-01', date'1998-05-01', 1, 1, 'f', 'Y')
union all
values(4, date'1998-05-01', date'1999-04-01', 2, 2, 'f', 'N')
union all
values(4, date'1999-04-01', cast(null as date), 3, cast(null as integer), 't', 'N')
);
create table locations as (
select distinct location
from my_data
where location is not null
);
create local temp table tmp_result_table (
animalid bigint,
location bigint,
rsk text,
lcd bigint
) ON COMMIT preserve ROWS;
DO $$DECLARE r record;
BEGIN
FOR r IN SELECT location FROM locations
LOOP
insert into tmp_result_table(animalid, rsk , lcd)
with dates as (
select movementdate as day
from my_data d
where d.location = r.location
union
select offmovementdate
from my_data d
where d.location = r.location
), intervals as (
select start_int, end_int, end_int - start_int as duration
from (
select day as start_int, lead(day) over(order by day nulls last) as end_int
from dates
where day is not null
) a
), overlapping_intervals as (
select rsk, start_int, end_int, duration, array_agg(animalid) as animals,
count(*)-1 as factor
from intervals i
join my_data d on (i.start_int>=d.movementdate and i.end_int<=d.offmovementdate)
where d.location = r.location
group by rsk, start_int, end_int, duration
having count(*) > 1
)
select unnest(animals), rsk, lcd
from (
select rsk, animals, sum(duration*factor) as lcd
from overlapping_intervals
group by rsk, animals
) a;
END LOOP;
RETURN;
END;$$
select animalid, rsk, sum(lcd) as lcd
from tmp_result_table
group by animalid, rsk
order by animalid, rsk desc;