Select前后N行与某一行的值相同
Select previous and next N rows with the same value as a certain row
我用键 id
和 time
构建了以下面板数据:
pdata <- tibble(
id = rep(1:10, each = 5),
time = rep(2016:2020, times = 10),
value = c(c(1,1,1,0,0), c(1,1,0,0,0), c(0,0,1,0,0), c(0,0,0,0,0), c(1,0,0,0,1), c(0,1,1,1,0), c(0,1,1,1,1), c(1,1,1,1,1), c(1,0,1,1,1), c(1,1,0,1,1))
)
pdata
# A tibble: 50 × 3
id time value
<int> <int> <dbl>
1 1 2016 1
2 1 2017 1
3 1 2018 1
4 1 2019 0
5 1 2020 0
6 2 2016 1
7 2 2017 1
8 2 2018 0
9 2 2019 0
10 2 2020 0
# … with 40 more rows
让我们假设 2018 年发生了一次冲击。我希望通过 id
将前 N 行和下 N 行对进行切片,这些行的值与冲击行的值相同。
我举几个例子来说明。对于 id == 5
,数据集如下所示:
pdata %>% filter(id == 5)
# A tibble: 5 × 3
id time value
<int> <int> <dbl>
1 5 2016 1
2 5 2017 0
3 5 2018 0
4 5 2019 0
5 5 2020 1
2018年id == 5
的value
为0,我希望保留上一行和下一行1包括当前行,因为所有这些观测值具有等于 0 的相同值:
# A tibble: 3 × 3
id time value
<int> <int> <dbl>
1 5 2017 0
2 5 2018 0
3 5 2019 0
对于id == 8
,我希望得到:
# A tibble: 5 × 3
id time value
<int> <int> <dbl>
1 8 2016 1
2 8 2017 1
3 8 2018 1
4 8 2019 1
5 8 2020 1
对于id == 1
,我希望得到空数据集,因为2017年的观测值和2019年的观测值不相同。
最终数据集应该是:
# A tibble: 19 × 3
id time value
<int> <int> <dbl>
1 4 2016 0
2 4 2017 0
3 4 2018 0
4 4 2019 0
5 4 2020 0
6 5 2017 0
7 5 2018 0
8 5 2019 0
9 6 2017 1
10 6 2018 1
11 6 2019 1
12 7 2017 1
13 7 2018 1
14 7 2019 1
15 8 2016 1
16 8 2017 1
17 8 2018 1
18 8 2019 1
19 8 2020 1
据我了解,这里有一个 dplyr
建议:
library(dplyr)
MyF <- function(id2, shock, nb_row) {
values <- pdata %>%
filter(id == id2) %>%
pull(value)
if (length(unique(values)) == 1) {
pdata %>%
filter(id == id2)
} else {
pdata %>%
filter(id == id2) %>%
filter(time >= shock - nb_row & time <= shock + nb_row) %>%
filter(length(unique(value)) == 1)
}
}
map_df(pdata %>%
select(id) %>%
distinct() %>%
pull(),
MyF,
shock = 2018, nb_row = 1)
## Or map_df(1:8,MyF,shock = 2018, nb_row = 1)
输出:
# A tibble: 19 x 3
id time value
<int> <int> <dbl>
1 4 2016 0
2 4 2017 0
3 4 2018 0
4 4 2019 0
5 4 2020 0
6 5 2017 0
7 5 2018 0
8 5 2019 0
9 6 2017 1
10 6 2018 1
11 6 2019 1
12 7 2017 1
13 7 2018 1
14 7 2019 1
15 8 2016 1
16 8 2017 1
17 8 2018 1
18 8 2019 1
19 8 2020 1
这是另一个 dplyr
解决方案。我们基本上按每个 id
的唯一值序列分组,然后只过滤到重复的冲击时间的最大距离。
pdata %>%
group_by(id) %>%
mutate(value_group = cumsum(value != lag(value, default = value[1]))) %>%
group_by(id, value_group) %>%
mutate(shock_diff = abs(time - 2018)) %>%
filter(shock_diff <= max(shock_diff[duplicated(shock_diff)], -Inf))
#> # A tibble: 19 × 5
#> # Groups: id, value_group [5]
#> id time value value_group shock_diff
#> <int> <int> <dbl> <int> <dbl>
#> 1 4 2016 0 0 2
#> 2 4 2017 0 0 1
#> 3 4 2018 0 0 0
#> 4 4 2019 0 0 1
#> 5 4 2020 0 0 2
#> 6 5 2017 0 1 1
#> 7 5 2018 0 1 0
#> 8 5 2019 0 1 1
#> 9 6 2017 1 1 1
#> 10 6 2018 1 1 0
#> 11 6 2019 1 1 1
#> 12 7 2017 1 1 1
#> 13 7 2018 1 1 0
#> 14 7 2019 1 1 1
#> 15 8 2016 1 0 2
#> 16 8 2017 1 0 1
#> 17 8 2018 1 0 0
#> 18 8 2019 1 0 1
#> 19 8 2020 1 0 2
焦年左右的对称范围和 'id'
之间的范围可能不同
在每个 'id' (by = id
) 中,使用 rleid
基于 运行 个相等值创建分组变量 'r'。在每个 'id' 和 运行 (by = .(id, r)
) 中,检查是否至少存在焦点年(例如 2018 年)的上一年和下一年 (if(sum(time %in% yr_rng) == 3)
)。如果是这样,select 焦点年份前后的行数相等 (min(c(shock - .I[1], .I[.N] - shock)
)。请注意,这里 selected 的年数可能因 'id'.
而异
library(data.table)
setDT(pdata)
yr = 2018
yr_rng = (yr - 1):(yr + 1)
pdata[ , r := rleid(value), by = id]
pdata[pdata[ , if(sum(time %in% yr_rng) == 3) {
shock = .I[time == 2018]
rng = min(c(shock - .I[1], .I[.N] - shock))
(shock - rng):(shock + rng)
}, by = .(id, r)]$V1]
id time value r
1: 4 2016 0 1
2: 4 2017 0 1
3: 4 2018 0 1
4: 4 2019 0 1
5: 4 2020 0 1
6: 5 2017 0 2
7: 5 2018 0 2
8: 5 2019 0 2
9: 6 2017 1 2
10: 6 2018 1 2
11: 6 2019 1 2
12: 7 2017 1 2
13: 7 2018 1 2
14: 7 2019 1 2
15: 8 2016 1 1
16: 8 2017 1 1
17: 8 2018 1 1
18: 8 2019 1 1
19: 8 2020 1 1
允许焦年周围的不对称范围
在每个 'id' 和 运行 (by = .(id, r)
) 中,检查焦点年份(例如 2018)的上一年和下一年是否都存在 (if(sum(time %in% yr_rng) == 3)
) .如果是这样,select 整个组 (.SD
)。
pdata[ , r := rleid(value), by = id]
pdata[ , if(sum(time %in% yr_rng) == 3) .SD, by = .(id, r)]
id r time value
1: 4 1 2016 0
2: 4 1 2017 0
3: 4 1 2018 0
4: 4 1 2019 0
5: 4 1 2020 0
6: 5 2 2017 0
7: 5 2 2018 0
8: 5 2 2019 0
9: 6 2 2017 1
10: 6 2 2018 1
11: 6 2 2019 1
12: 7 2 2017 1
13: 7 2 2018 1
14: 7 2 2019 1
15: 7 2 2020 1
16: 8 1 2016 1
17: 8 1 2017 1
18: 8 1 2018 1
19: 8 1 2019 1
20: 8 1 2020 1
data.table的解决方案:
# load the package & convert data to a data.table
library(data.table)
setDT(pdata)
# define shock-year and number of previous/next rows
shock <- 2018
n <- 2
# filter
pdata[, .SD[value == value[time == shock] &
between(time, shock - n, shock + n) &
value == rev(value)][.N > 1 & all(diff(time) == 1)]
, by = id]
给出:
id time value
1: 4 2016 0
2: 4 2017 0
3: 4 2018 0
4: 4 2019 0
5: 4 2020 0
6: 5 2017 0
7: 5 2018 0
8: 5 2019 0
9: 6 2017 1
10: 6 2018 1
11: 6 2019 1
12: 7 2017 1
13: 7 2018 1
14: 7 2019 1
15: 8 2016 1
16: 8 2017 1
17: 8 2018 1
18: 8 2019 1
19: 8 2020 1
已用数据:
pdata <- data.frame(
id = rep(1:10, each = 5),
time = rep(2016:2020, times = 10),
value = c(c(1,1,1,0,0), c(1,1,0,0,0), c(0,0,1,0,0), c(0,0,0,0,0), c(1,0,0,0,1), c(0,1,1,1,0), c(0,1,1,1,1), c(1,1,1,1,1), c(1,0,1,1,1), c(1,1,0,1,1))
)
使用 data.table 解决问题的一种方法:
library(data.table)
yrs=2017:2019
setDT(pdata)[, if(uniqueN(value)==1) .(time, value)
else if(uniqueN(value <- value[time %in% yrs])==1) .(time=yrs, value),
by=id]
# id time value
# 1: 4 2016 0
# 2: 4 2017 0
# 3: 4 2018 0
# 4: 4 2019 0
# 5: 4 2020 0
# 6: 5 2017 0
# 7: 5 2018 0
# 8: 5 2019 0
# 9: 6 2017 1
# 10: 6 2018 1
# 11: 6 2019 1
# 12: 7 2017 1
# 13: 7 2018 1
# 14: 7 2019 1
# 15: 8 2016 1
# 16: 8 2017 1
# 17: 8 2018 1
# 18: 8 2019 1
# 19: 8 2020 1
我用键 id
和 time
构建了以下面板数据:
pdata <- tibble(
id = rep(1:10, each = 5),
time = rep(2016:2020, times = 10),
value = c(c(1,1,1,0,0), c(1,1,0,0,0), c(0,0,1,0,0), c(0,0,0,0,0), c(1,0,0,0,1), c(0,1,1,1,0), c(0,1,1,1,1), c(1,1,1,1,1), c(1,0,1,1,1), c(1,1,0,1,1))
)
pdata
# A tibble: 50 × 3
id time value
<int> <int> <dbl>
1 1 2016 1
2 1 2017 1
3 1 2018 1
4 1 2019 0
5 1 2020 0
6 2 2016 1
7 2 2017 1
8 2 2018 0
9 2 2019 0
10 2 2020 0
# … with 40 more rows
让我们假设 2018 年发生了一次冲击。我希望通过 id
将前 N 行和下 N 行对进行切片,这些行的值与冲击行的值相同。
我举几个例子来说明。对于 id == 5
,数据集如下所示:
pdata %>% filter(id == 5)
# A tibble: 5 × 3
id time value
<int> <int> <dbl>
1 5 2016 1
2 5 2017 0
3 5 2018 0
4 5 2019 0
5 5 2020 1
2018年id == 5
的value
为0,我希望保留上一行和下一行1包括当前行,因为所有这些观测值具有等于 0 的相同值:
# A tibble: 3 × 3
id time value
<int> <int> <dbl>
1 5 2017 0
2 5 2018 0
3 5 2019 0
对于id == 8
,我希望得到:
# A tibble: 5 × 3
id time value
<int> <int> <dbl>
1 8 2016 1
2 8 2017 1
3 8 2018 1
4 8 2019 1
5 8 2020 1
对于id == 1
,我希望得到空数据集,因为2017年的观测值和2019年的观测值不相同。
最终数据集应该是:
# A tibble: 19 × 3
id time value
<int> <int> <dbl>
1 4 2016 0
2 4 2017 0
3 4 2018 0
4 4 2019 0
5 4 2020 0
6 5 2017 0
7 5 2018 0
8 5 2019 0
9 6 2017 1
10 6 2018 1
11 6 2019 1
12 7 2017 1
13 7 2018 1
14 7 2019 1
15 8 2016 1
16 8 2017 1
17 8 2018 1
18 8 2019 1
19 8 2020 1
据我了解,这里有一个 dplyr
建议:
library(dplyr)
MyF <- function(id2, shock, nb_row) {
values <- pdata %>%
filter(id == id2) %>%
pull(value)
if (length(unique(values)) == 1) {
pdata %>%
filter(id == id2)
} else {
pdata %>%
filter(id == id2) %>%
filter(time >= shock - nb_row & time <= shock + nb_row) %>%
filter(length(unique(value)) == 1)
}
}
map_df(pdata %>%
select(id) %>%
distinct() %>%
pull(),
MyF,
shock = 2018, nb_row = 1)
## Or map_df(1:8,MyF,shock = 2018, nb_row = 1)
输出:
# A tibble: 19 x 3
id time value
<int> <int> <dbl>
1 4 2016 0
2 4 2017 0
3 4 2018 0
4 4 2019 0
5 4 2020 0
6 5 2017 0
7 5 2018 0
8 5 2019 0
9 6 2017 1
10 6 2018 1
11 6 2019 1
12 7 2017 1
13 7 2018 1
14 7 2019 1
15 8 2016 1
16 8 2017 1
17 8 2018 1
18 8 2019 1
19 8 2020 1
这是另一个 dplyr
解决方案。我们基本上按每个 id
的唯一值序列分组,然后只过滤到重复的冲击时间的最大距离。
pdata %>%
group_by(id) %>%
mutate(value_group = cumsum(value != lag(value, default = value[1]))) %>%
group_by(id, value_group) %>%
mutate(shock_diff = abs(time - 2018)) %>%
filter(shock_diff <= max(shock_diff[duplicated(shock_diff)], -Inf))
#> # A tibble: 19 × 5
#> # Groups: id, value_group [5]
#> id time value value_group shock_diff
#> <int> <int> <dbl> <int> <dbl>
#> 1 4 2016 0 0 2
#> 2 4 2017 0 0 1
#> 3 4 2018 0 0 0
#> 4 4 2019 0 0 1
#> 5 4 2020 0 0 2
#> 6 5 2017 0 1 1
#> 7 5 2018 0 1 0
#> 8 5 2019 0 1 1
#> 9 6 2017 1 1 1
#> 10 6 2018 1 1 0
#> 11 6 2019 1 1 1
#> 12 7 2017 1 1 1
#> 13 7 2018 1 1 0
#> 14 7 2019 1 1 1
#> 15 8 2016 1 0 2
#> 16 8 2017 1 0 1
#> 17 8 2018 1 0 0
#> 18 8 2019 1 0 1
#> 19 8 2020 1 0 2
焦年左右的对称范围和 'id'
之间的范围可能不同在每个 'id' (by = id
) 中,使用 rleid
基于 运行 个相等值创建分组变量 'r'。在每个 'id' 和 运行 (by = .(id, r)
) 中,检查是否至少存在焦点年(例如 2018 年)的上一年和下一年 (if(sum(time %in% yr_rng) == 3)
)。如果是这样,select 焦点年份前后的行数相等 (min(c(shock - .I[1], .I[.N] - shock)
)。请注意,这里 selected 的年数可能因 'id'.
library(data.table)
setDT(pdata)
yr = 2018
yr_rng = (yr - 1):(yr + 1)
pdata[ , r := rleid(value), by = id]
pdata[pdata[ , if(sum(time %in% yr_rng) == 3) {
shock = .I[time == 2018]
rng = min(c(shock - .I[1], .I[.N] - shock))
(shock - rng):(shock + rng)
}, by = .(id, r)]$V1]
id time value r
1: 4 2016 0 1
2: 4 2017 0 1
3: 4 2018 0 1
4: 4 2019 0 1
5: 4 2020 0 1
6: 5 2017 0 2
7: 5 2018 0 2
8: 5 2019 0 2
9: 6 2017 1 2
10: 6 2018 1 2
11: 6 2019 1 2
12: 7 2017 1 2
13: 7 2018 1 2
14: 7 2019 1 2
15: 8 2016 1 1
16: 8 2017 1 1
17: 8 2018 1 1
18: 8 2019 1 1
19: 8 2020 1 1
允许焦年周围的不对称范围
在每个 'id' 和 运行 (by = .(id, r)
) 中,检查焦点年份(例如 2018)的上一年和下一年是否都存在 (if(sum(time %in% yr_rng) == 3)
) .如果是这样,select 整个组 (.SD
)。
pdata[ , r := rleid(value), by = id]
pdata[ , if(sum(time %in% yr_rng) == 3) .SD, by = .(id, r)]
id r time value
1: 4 1 2016 0
2: 4 1 2017 0
3: 4 1 2018 0
4: 4 1 2019 0
5: 4 1 2020 0
6: 5 2 2017 0
7: 5 2 2018 0
8: 5 2 2019 0
9: 6 2 2017 1
10: 6 2 2018 1
11: 6 2 2019 1
12: 7 2 2017 1
13: 7 2 2018 1
14: 7 2 2019 1
15: 7 2 2020 1
16: 8 1 2016 1
17: 8 1 2017 1
18: 8 1 2018 1
19: 8 1 2019 1
20: 8 1 2020 1
data.table的解决方案:
# load the package & convert data to a data.table
library(data.table)
setDT(pdata)
# define shock-year and number of previous/next rows
shock <- 2018
n <- 2
# filter
pdata[, .SD[value == value[time == shock] &
between(time, shock - n, shock + n) &
value == rev(value)][.N > 1 & all(diff(time) == 1)]
, by = id]
给出:
id time value 1: 4 2016 0 2: 4 2017 0 3: 4 2018 0 4: 4 2019 0 5: 4 2020 0 6: 5 2017 0 7: 5 2018 0 8: 5 2019 0 9: 6 2017 1 10: 6 2018 1 11: 6 2019 1 12: 7 2017 1 13: 7 2018 1 14: 7 2019 1 15: 8 2016 1 16: 8 2017 1 17: 8 2018 1 18: 8 2019 1 19: 8 2020 1
已用数据:
pdata <- data.frame(
id = rep(1:10, each = 5),
time = rep(2016:2020, times = 10),
value = c(c(1,1,1,0,0), c(1,1,0,0,0), c(0,0,1,0,0), c(0,0,0,0,0), c(1,0,0,0,1), c(0,1,1,1,0), c(0,1,1,1,1), c(1,1,1,1,1), c(1,0,1,1,1), c(1,1,0,1,1))
)
使用 data.table 解决问题的一种方法:
library(data.table)
yrs=2017:2019
setDT(pdata)[, if(uniqueN(value)==1) .(time, value)
else if(uniqueN(value <- value[time %in% yrs])==1) .(time=yrs, value),
by=id]
# id time value
# 1: 4 2016 0
# 2: 4 2017 0
# 3: 4 2018 0
# 4: 4 2019 0
# 5: 4 2020 0
# 6: 5 2017 0
# 7: 5 2018 0
# 8: 5 2019 0
# 9: 6 2017 1
# 10: 6 2018 1
# 11: 6 2019 1
# 12: 7 2017 1
# 13: 7 2018 1
# 14: 7 2019 1
# 15: 8 2016 1
# 16: 8 2017 1
# 17: 8 2018 1
# 18: 8 2019 1
# 19: 8 2020 1