在 R 中按组创建连续年份的计数
Create count of sequential years by groups in R
这里是R新手。我正在寻找一个 dplyr 解决方案(最好)来创建一个向量,该向量显示一个组内的连续年数。如果序列被任何间隙打断,即使是同一组,计数器也应该重新开始。
我的数据看起来与此类似:
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(magrittr)
library(tidyverse)
df <- tribble(
~id, ~ref, ~branch, ~year, ~unit, ~client, ~group,
1, 561, "LA", 2000, "x", "y", "z",
2, 561, "LA", 2001, "x", "y", "z",
3, 561, "LA", 2002, "x", "y", "z",
4, 561, "LA", 2003, "x", "y", "z",
5, 561, "LA", 2004, "x", "y", "z",
6, 561, "LA", 2005, "x", "y", "z",
7, 561, "LA", 2007, "x", "y", "z",
8, 561, "LA", 2008, "x", "y", "z",
9, 561, "LA", 2009, "x", "y", "z",
)
我的预期输出是这样的,其中添加了“seq_count”:
df_exp <- tribble(
~id, ~ref, ~branch, ~year, ~unit, ~client, ~group, ~seq_count,
1, 561, "LA", 2000, "x", "y", "z", 6,
2, 561, "LA", 2001, "x", "y", "z", 6,
3, 561, "LA", 2002, "x", "y", "z", 6,
4, 561, "LA", 2003, "x", "y", "z", 6,
5, 561, "LA", 2004, "x", "y", "z", 6,
6, 561, "LA", 2005, "x", "y", "z", 6,
7, 561, "LA", 2007, "x", "y", "z", 3,
8, 561, "LA", 2008, "x", "y", "z", 3,
9, 561, "LA", 2009, "x", "y", "z", 3,
)
我已尝试使用 dplyr::add_count,如下所示:
df1 <- df %>%
group_by(ref, branch, unit, client, group) %>%
add_count()
但是,这只添加了 group_by 命令指定的计数,并没有考虑 2005 年和 2007 年之间的差距。有没有一种方法可以在 R 中以简洁的方式做到这一点?
n()
将为您提供组中的观察次数。
df1 <- df %>%
group_by(ref, branch, unit, client, group) %>%
mutate(seq_count = n())
如果您只想要摘要,可以使用 summarise
而不是 mutate
。
您可以创建另一个组,当年份之间有差距时会发生变化。
library(dplyr)
df %>%
add_count(group, grp = cumsum(year - lag(year, default = first(year)) > 1),
name = 'seq_count')
# A tibble: 9 x 9
# id ref branch year unit client group grp seq_count
# <dbl> <dbl> <chr> <dbl> <chr> <chr> <chr> <int> <int>
#1 1 561 LA 2000 x y z 0 6
#2 2 561 LA 2001 x y z 0 6
#3 3 561 LA 2002 x y z 0 6
#4 4 561 LA 2003 x y z 0 6
#5 5 561 LA 2004 x y z 0 6
#6 6 561 LA 2005 x y z 0 6
#7 7 561 LA 2007 x y z 1 3
#8 8 561 LA 2008 x y z 1 3
#9 9 561 LA 2009 x y z 1 3
或 n()
df %>%
group_by(group, grp = cumsum(year - lag(year, default = first(year)) > 1)) %>%
mutate(seq_count = n())
这里是R新手。我正在寻找一个 dplyr 解决方案(最好)来创建一个向量,该向量显示一个组内的连续年数。如果序列被任何间隙打断,即使是同一组,计数器也应该重新开始。
我的数据看起来与此类似:
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(magrittr)
library(tidyverse)
df <- tribble(
~id, ~ref, ~branch, ~year, ~unit, ~client, ~group,
1, 561, "LA", 2000, "x", "y", "z",
2, 561, "LA", 2001, "x", "y", "z",
3, 561, "LA", 2002, "x", "y", "z",
4, 561, "LA", 2003, "x", "y", "z",
5, 561, "LA", 2004, "x", "y", "z",
6, 561, "LA", 2005, "x", "y", "z",
7, 561, "LA", 2007, "x", "y", "z",
8, 561, "LA", 2008, "x", "y", "z",
9, 561, "LA", 2009, "x", "y", "z",
)
我的预期输出是这样的,其中添加了“seq_count”:
df_exp <- tribble(
~id, ~ref, ~branch, ~year, ~unit, ~client, ~group, ~seq_count,
1, 561, "LA", 2000, "x", "y", "z", 6,
2, 561, "LA", 2001, "x", "y", "z", 6,
3, 561, "LA", 2002, "x", "y", "z", 6,
4, 561, "LA", 2003, "x", "y", "z", 6,
5, 561, "LA", 2004, "x", "y", "z", 6,
6, 561, "LA", 2005, "x", "y", "z", 6,
7, 561, "LA", 2007, "x", "y", "z", 3,
8, 561, "LA", 2008, "x", "y", "z", 3,
9, 561, "LA", 2009, "x", "y", "z", 3,
)
我已尝试使用 dplyr::add_count,如下所示:
df1 <- df %>%
group_by(ref, branch, unit, client, group) %>%
add_count()
但是,这只添加了 group_by 命令指定的计数,并没有考虑 2005 年和 2007 年之间的差距。有没有一种方法可以在 R 中以简洁的方式做到这一点?
n()
将为您提供组中的观察次数。
df1 <- df %>%
group_by(ref, branch, unit, client, group) %>%
mutate(seq_count = n())
如果您只想要摘要,可以使用 summarise
而不是 mutate
。
您可以创建另一个组,当年份之间有差距时会发生变化。
library(dplyr)
df %>%
add_count(group, grp = cumsum(year - lag(year, default = first(year)) > 1),
name = 'seq_count')
# A tibble: 9 x 9
# id ref branch year unit client group grp seq_count
# <dbl> <dbl> <chr> <dbl> <chr> <chr> <chr> <int> <int>
#1 1 561 LA 2000 x y z 0 6
#2 2 561 LA 2001 x y z 0 6
#3 3 561 LA 2002 x y z 0 6
#4 4 561 LA 2003 x y z 0 6
#5 5 561 LA 2004 x y z 0 6
#6 6 561 LA 2005 x y z 0 6
#7 7 561 LA 2007 x y z 1 3
#8 8 561 LA 2008 x y z 1 3
#9 9 561 LA 2009 x y z 1 3
或 n()
df %>%
group_by(group, grp = cumsum(year - lag(year, default = first(year)) > 1)) %>%
mutate(seq_count = n())