获取组内特定值之前的所有行和之后的 n 行
Get all rows before and n rows after a specific value within group
我正在寻找执行以下操作的简单 R/SQL 代码。在特定公司内,它在 VALUE 列中查找 1 的第一个实例,然后
- 提取之前的所有行(在 VALUE 列中都为 0)和
- 在那之后恰好两行,只要这两行的值为 0。
代码对所有公司执行此操作。
所以这个table.....
----------------------
| FIRM | YEAR | VALUE |
----------------------
| A | 2007 | 0 |
----------------------
| A | 2008 | 0 |
----------------------
| A | 2009 | 0 |
----------------------
| A | 2010 | 1 |
----------------------
| A | 2011 | 0 |
----------------------
| A | 2012 | 0 |
----------------------
| B | 2009 | 0 |
----------------------
| B | 2010 | 1 |
----------------------
| B | 2011 | 0 |
----------------------
| C | 2010 | 0 |
----------------------
| C | 2011 | 1 |
----------------------
| C | 2012 | 1 |
----------------------
看起来像这样...
--------------------------
| FIRM | YEAR | VALUE |
--------------------------
| A | 2007 | 0 |
----------------------
| A | 2008 | 0 |
----------------------
| A | 2009 | 0 |
----------------------
| A | 2010 | 1 |
----------------------
| A | 2011 | 0 |
----------------------
| A | 2012 | 0 |
----------------------
非常感谢您的帮助。谢谢。
您可以计算最小年份然后使用此信息:
with t as (
select firm, min(year) as min_year_1
from tab t
where value = 1
group by firm
)
select t.*
from (select t.*,
lag(value) over (partition by firm order by year) as prev_value,
lead(value) over (partition by firm order by year) as next_value
from tab t
) t join
tt
on tt.firm = t.firm
where t.year <= tt.min_year or
(t.year = tt.min_year + 1 and
t.value = 0 and
t.next_value = 0
) or
(t.year = tt.min_year + 2 and
t.value = 0 and
t.prev_value = 0
);
0
之后两行的最后一个条件相当棘手。
这里假设年份是连续的,没有间隔,这与你问题中的数据是一致的。
编辑:
您只需使用 window 函数即可:
select t.*
from (select t.*,
count(*) over (partition by firm, running_value) as cnt,
row_number() over (partition by firm, running_value) as seqnum
from (select t.*,
sum(value) over (partition by firm order by year) as running_value
from tab t
) t
) t
where running_value = 0 or
(running_value = 1 and seqnum = 1) or -- first "1"
(running_value = 1 and seqnum <= 3 and
cnt >= 3);
使用 R,您可以创建一个函数来 return 选择行号。
get_rows <- function(VALUE) {
ind <- which(VALUE == 1)[1]
if ((ind + 2) <= length(VALUE) && all(VALUE[c(ind + 1,ind + 2)] == 0))
sort(c(which(VALUE[seq_len(ind + 2)] == 0), ind))
else 0
}
并将其应用于每个 FIRM
。
library(dplyr)
df %>% group_by(FIRM) %>% slice(get_rows(VALUE))
# FIRM YEAR VALUE
# <fct> <int> <int>
#1 A 2007 0
#2 A 2008 0
#3 A 2009 0
#4 A 2010 1
#5 A 2011 0
#6 A 2012 0
数据
df <- structure(list(FIRM = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
YEAR = c(2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2009L,
2010L, 2011L, 2010L, 2011L, 2012L), VALUE = c(0L, 0L, 0L,
1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 1L)), class = "data.frame",row.names = c(NA, -12L))
我正在寻找执行以下操作的简单 R/SQL 代码。在特定公司内,它在 VALUE 列中查找 1 的第一个实例,然后
- 提取之前的所有行(在 VALUE 列中都为 0)和
- 在那之后恰好两行,只要这两行的值为 0。
代码对所有公司执行此操作。
所以这个table.....
----------------------
| FIRM | YEAR | VALUE |
----------------------
| A | 2007 | 0 |
----------------------
| A | 2008 | 0 |
----------------------
| A | 2009 | 0 |
----------------------
| A | 2010 | 1 |
----------------------
| A | 2011 | 0 |
----------------------
| A | 2012 | 0 |
----------------------
| B | 2009 | 0 |
----------------------
| B | 2010 | 1 |
----------------------
| B | 2011 | 0 |
----------------------
| C | 2010 | 0 |
----------------------
| C | 2011 | 1 |
----------------------
| C | 2012 | 1 |
----------------------
看起来像这样...
--------------------------
| FIRM | YEAR | VALUE |
--------------------------
| A | 2007 | 0 |
----------------------
| A | 2008 | 0 |
----------------------
| A | 2009 | 0 |
----------------------
| A | 2010 | 1 |
----------------------
| A | 2011 | 0 |
----------------------
| A | 2012 | 0 |
----------------------
非常感谢您的帮助。谢谢。
您可以计算最小年份然后使用此信息:
with t as (
select firm, min(year) as min_year_1
from tab t
where value = 1
group by firm
)
select t.*
from (select t.*,
lag(value) over (partition by firm order by year) as prev_value,
lead(value) over (partition by firm order by year) as next_value
from tab t
) t join
tt
on tt.firm = t.firm
where t.year <= tt.min_year or
(t.year = tt.min_year + 1 and
t.value = 0 and
t.next_value = 0
) or
(t.year = tt.min_year + 2 and
t.value = 0 and
t.prev_value = 0
);
0
之后两行的最后一个条件相当棘手。
这里假设年份是连续的,没有间隔,这与你问题中的数据是一致的。
编辑:
您只需使用 window 函数即可:
select t.*
from (select t.*,
count(*) over (partition by firm, running_value) as cnt,
row_number() over (partition by firm, running_value) as seqnum
from (select t.*,
sum(value) over (partition by firm order by year) as running_value
from tab t
) t
) t
where running_value = 0 or
(running_value = 1 and seqnum = 1) or -- first "1"
(running_value = 1 and seqnum <= 3 and
cnt >= 3);
使用 R,您可以创建一个函数来 return 选择行号。
get_rows <- function(VALUE) {
ind <- which(VALUE == 1)[1]
if ((ind + 2) <= length(VALUE) && all(VALUE[c(ind + 1,ind + 2)] == 0))
sort(c(which(VALUE[seq_len(ind + 2)] == 0), ind))
else 0
}
并将其应用于每个 FIRM
。
library(dplyr)
df %>% group_by(FIRM) %>% slice(get_rows(VALUE))
# FIRM YEAR VALUE
# <fct> <int> <int>
#1 A 2007 0
#2 A 2008 0
#3 A 2009 0
#4 A 2010 1
#5 A 2011 0
#6 A 2012 0
数据
df <- structure(list(FIRM = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
YEAR = c(2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2009L,
2010L, 2011L, 2010L, 2011L, 2012L), VALUE = c(0L, 0L, 0L,
1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 1L)), class = "data.frame",row.names = c(NA, -12L))