如何使用 BigQuery 中每个用户的线性插值来填充不规则缺失的时间序列值?
How to fill irregularly missing time-series values with linear interepolation by each user in BigQuery?
我有数据缺少 时间序列 值 不规则 每个用户,我想使用 BigQuery Standard SQL.
+------+---------------------+-------+
| name | time | value |
+------+---------------------+-------+
| Jane | 2020-11-14 09:01:00 | 3 |
| Jane | 2020-11-14 09:05:00 | 5 |
| Jane | 2020-11-14 09:07:00 | 1 |
| Jane | 2020-11-14 09:09:00 | 8 |
| Jane | 2020-11-14 09:10:00 | 4 |
| Kay | 2020-11-14 09:01:00 | 7 |
| Kay | 2020-11-14 09:04:00 | 1 |
| Kay | 2020-11-14 09:05:00 | 10 |
| Kay | 2020-11-14 09:09:00 | 6 |
| Kay | 2020-11-14 09:10:00 | 7 |
+------+---------------------+-------+
我想按如下方式转换它:
+------+---------------------+-------+-----------------+
| name | time | value | |
+------+---------------------+-------+-----------------+
| Jane | 2020-11-14 09:01:00 | 3 | |
| Jane | 2020-11-14 09:02:00 | 3.5 | <= interpolaetd |
| Jane | 2020-11-14 09:03:00 | 4 | <= interpolaetd |
| Jane | 2020-11-14 09:04:00 | 4.5 | <= interpolaetd |
| Jane | 2020-11-14 09:05:00 | 5 | |
| Jane | 2020-11-14 09:06:00 | 3 | <= interpolaetd |
| Jane | 2020-11-14 09:07:00 | 1 | |
| Jane | 2020-11-14 09:08:00 | 4.5 | <= interpolaetd |
| Jane | 2020-11-14 09:09:00 | 8 | |
| Jane | 2020-11-14 09:10:00 | 4 | |
| Kay | 2020-11-14 09:01:00 | 7 | |
| Kay | 2020-11-14 09:02:00 | 5 | <= interpolaetd |
| Kay | 2020-11-14 09:03:00 | 3 | <= interpolaetd |
| Kay | 2020-11-14 09:04:00 | 1 | |
| Kay | 2020-11-14 09:05:00 | 10 | |
| Kay | 2020-11-14 09:06:00 | 9 | <= interpolaetd |
| Kay | 2020-11-14 09:07:00 | 8 | <= interpolaetd |
| Kay | 2020-11-14 09:08:00 | 7 | <= interpolaetd |
| Kay | 2020-11-14 09:09:00 | 6 | |
| Kay | 2020-11-14 09:10:00 | 7 | |
+------+---------------------+-------+-----------------+
我可以问你一些聪明的解决方案吗?
补充:这是的应用题。它非常相似但不同之处在于此数据是 time seris 数据并且它具有 每个用户的名称.
谢谢。
这与您之前的问题没有太大区别。从接受的答案开始,您可以:
select name, time,
ifnull(value, start_value + (end_value - start_value) / (end_tick - start_tick) * (time - start_tick)) as value_interpolated
from (
select name, time, value,
first_value(tick ignore nulls ) over win1 as start_tick,
first_value(value ignore nulls) over win1 as start_value,
first_value(tick ignore nulls ) over win2 as end_tick,
first_value(value ignore nulls) over win2 as end_value,
from (
select name, time, t.time as tick, value
from (
select name, generate_array(min(time), max(time)) times
from `project.dataset.table`
group by name
)
cross join unnest(times) time
left join `project.dataset.table` t using(name, time)
)
window
win1 as (partition by name order by time desc rows between current row and unbounded following),
win2 as (partition by name order by time rows between current row and unbounded following)
)
以下是 BigQuery SQL
#standardSQL
select name, time,
ifnull(value, start_value
+ (end_value - start_value) / timestamp_diff(end_tick, start_tick, minute) * timestamp_diff(time, start_tick, minute)
) as value_interpolated
from (
select name, time, value,
first_value(tick ignore nulls ) over win1 as start_tick,
first_value(value ignore nulls) over win1 as start_value,
first_value(tick ignore nulls ) over win2 as end_tick,
first_value(value ignore nulls) over win2 as end_value,
from (
select name, time, t.time as tick, value
from (
select name, generate_timestamp_array(min(time), max(time), interval 1 minute) times
from `project.dataset.table`
group by name
)
cross join unnest(times) time
left join `project.dataset.table` t
using(name, time)
)
window
win1 as (partition by name order by time desc rows between current row and unbounded following),
win2 as (partition by name order by time rows between current row and unbounded following)
)
如果应用于您问题中的示例数据 - 输出为
我有数据缺少 时间序列 值 不规则 每个用户,我想使用 BigQuery Standard SQL.
+------+---------------------+-------+
| name | time | value |
+------+---------------------+-------+
| Jane | 2020-11-14 09:01:00 | 3 |
| Jane | 2020-11-14 09:05:00 | 5 |
| Jane | 2020-11-14 09:07:00 | 1 |
| Jane | 2020-11-14 09:09:00 | 8 |
| Jane | 2020-11-14 09:10:00 | 4 |
| Kay | 2020-11-14 09:01:00 | 7 |
| Kay | 2020-11-14 09:04:00 | 1 |
| Kay | 2020-11-14 09:05:00 | 10 |
| Kay | 2020-11-14 09:09:00 | 6 |
| Kay | 2020-11-14 09:10:00 | 7 |
+------+---------------------+-------+
我想按如下方式转换它:
+------+---------------------+-------+-----------------+
| name | time | value | |
+------+---------------------+-------+-----------------+
| Jane | 2020-11-14 09:01:00 | 3 | |
| Jane | 2020-11-14 09:02:00 | 3.5 | <= interpolaetd |
| Jane | 2020-11-14 09:03:00 | 4 | <= interpolaetd |
| Jane | 2020-11-14 09:04:00 | 4.5 | <= interpolaetd |
| Jane | 2020-11-14 09:05:00 | 5 | |
| Jane | 2020-11-14 09:06:00 | 3 | <= interpolaetd |
| Jane | 2020-11-14 09:07:00 | 1 | |
| Jane | 2020-11-14 09:08:00 | 4.5 | <= interpolaetd |
| Jane | 2020-11-14 09:09:00 | 8 | |
| Jane | 2020-11-14 09:10:00 | 4 | |
| Kay | 2020-11-14 09:01:00 | 7 | |
| Kay | 2020-11-14 09:02:00 | 5 | <= interpolaetd |
| Kay | 2020-11-14 09:03:00 | 3 | <= interpolaetd |
| Kay | 2020-11-14 09:04:00 | 1 | |
| Kay | 2020-11-14 09:05:00 | 10 | |
| Kay | 2020-11-14 09:06:00 | 9 | <= interpolaetd |
| Kay | 2020-11-14 09:07:00 | 8 | <= interpolaetd |
| Kay | 2020-11-14 09:08:00 | 7 | <= interpolaetd |
| Kay | 2020-11-14 09:09:00 | 6 | |
| Kay | 2020-11-14 09:10:00 | 7 | |
+------+---------------------+-------+-----------------+
我可以问你一些聪明的解决方案吗?
补充:这是
谢谢。
这与您之前的问题没有太大区别。从接受的答案开始,您可以:
select name, time,
ifnull(value, start_value + (end_value - start_value) / (end_tick - start_tick) * (time - start_tick)) as value_interpolated
from (
select name, time, value,
first_value(tick ignore nulls ) over win1 as start_tick,
first_value(value ignore nulls) over win1 as start_value,
first_value(tick ignore nulls ) over win2 as end_tick,
first_value(value ignore nulls) over win2 as end_value,
from (
select name, time, t.time as tick, value
from (
select name, generate_array(min(time), max(time)) times
from `project.dataset.table`
group by name
)
cross join unnest(times) time
left join `project.dataset.table` t using(name, time)
)
window
win1 as (partition by name order by time desc rows between current row and unbounded following),
win2 as (partition by name order by time rows between current row and unbounded following)
)
以下是 BigQuery SQL
#standardSQL
select name, time,
ifnull(value, start_value
+ (end_value - start_value) / timestamp_diff(end_tick, start_tick, minute) * timestamp_diff(time, start_tick, minute)
) as value_interpolated
from (
select name, time, value,
first_value(tick ignore nulls ) over win1 as start_tick,
first_value(value ignore nulls) over win1 as start_value,
first_value(tick ignore nulls ) over win2 as end_tick,
first_value(value ignore nulls) over win2 as end_value,
from (
select name, time, t.time as tick, value
from (
select name, generate_timestamp_array(min(time), max(time), interval 1 minute) times
from `project.dataset.table`
group by name
)
cross join unnest(times) time
left join `project.dataset.table` t
using(name, time)
)
window
win1 as (partition by name order by time desc rows between current row and unbounded following),
win2 as (partition by name order by time rows between current row and unbounded following)
)
如果应用于您问题中的示例数据 - 输出为