如何识别 dplyr() 组中的更改名称?
How to identify changes name in dplyr() groups?
我正在尝试弄清楚如何识别组内的名称更改。
例如,我有一个如下所示的数据框:
df <- data.frame(
state = rep(c("CA", "WI", "NY"), each = 3),
year = rep(c(2000, 2001), each = 9),
name = c("John", "Paul", "Sally",
"Mary", "Fred", "Jane",
"Linda", "Carl", "Jim",
"Peter", "Paul", "Sally",
"Mary", "Kate", "Jane",
"Linda", "Carl", "Jim")
)
> df
state year name
1 CA 2000 John
2 CA 2000 Paul
3 CA 2000 Sally
4 WI 2000 Mary
5 WI 2000 Fred
6 WI 2000 Jane
7 NY 2000 Linda
8 NY 2000 Carl
9 NY 2000 Jim
10 CA 2001 Peter
11 CA 2001 Paul
12 CA 2001 Sally
13 WI 2001 Mary
14 WI 2001 Kate
15 WI 2001 Jane
16 NY 2001 Linda
17 NY 2001 Carl
18 NY 2001 Jim
如您所见,“Peter”在 2001 年取代了“John”,“Kate”在 2001 年取代了“Fred”。
所以我希望输出看起来像:
df <- data.frame(
state = rep(c("CA", "WI", "NY"), each = 3),
year = rep(c(2000, 2001), each = 9),
name = c("John", "Paul", "Sally",
"Mary", "Fred", "Jane",
"Linda", "Carl", "Jim",
"Peter", "Paul", "Sally",
"Mary", "Kate", "Jane",
"Linda", "Carl", "Jim"),
change = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
1, 0, 0, 0, 1, 0, 0, 0, 0)
)
> df
state year name change
1 CA 2000 John NA
2 CA 2000 Paul NA
3 CA 2000 Sally NA
4 WI 2000 Mary NA
5 WI 2000 Fred NA
6 WI 2000 Jane NA
7 NY 2000 Linda NA
8 NY 2000 Carl NA
9 NY 2000 Jim NA
10 CA 2001 Peter 1
11 CA 2001 Paul 0
12 CA 2001 Sally 0
13 WI 2001 Mary 0
14 WI 2001 Kate 1
15 WI 2001 Jane 0
16 NY 2001 Linda 0
17 NY 2001 Carl 0
18 NY 2001 Jim 0
如您所见,2001 年的 Peter 和 2001 年的 Kate 在“更改”列中均标记为“1”,因为他们分别替换了 2000-CA 和 2000-NY 的“John”和“Fred” .
我一直在考虑使用一些滞后方法,但它似乎只查看前一行,而不是按州、年组:
df2 <- df %>%
group_by(state, year) %>%
mutate(change = lag(name, order_by = year))
如有任何帮助,我们将不胜感激!
根据预期的输出,也许这有帮助——根据整个数据中的duplicated
'name'创建一个逻辑列,然后按'year'、[=15=分组] all
值为 FALSE (!change
),然后替换为 NA
或 else
将逻辑转换为二进制 (+
)
library(dplyr)
df %>%
mutate(rn = row_number()) %>%
arrange(state, year, name) %>%
group_by(state) %>%
mutate(change = !duplicated(name)) %>%
group_by(year, .add = TRUE) %>%
mutate(
change = if(all(change)) NA_integer_ else +(change)) %>%
ungroup %>%
arrange(rn) %>%
select(-rn)
-输出
# A tibble: 18 × 4
state year name change
<chr> <dbl> <chr> <int>
1 CA 2000 John NA
2 CA 2000 Paul NA
3 CA 2000 Sally NA
4 WI 2000 Mary NA
5 WI 2000 Fred NA
6 WI 2000 Jane NA
7 NY 2000 Linda NA
8 NY 2000 Carl NA
9 NY 2000 Jim NA
10 CA 2001 Peter 1
11 CA 2001 Paul 0
12 CA 2001 Sally 0
13 WI 2001 Mary 0
14 WI 2001 Kate 1
15 WI 2001 Jane 0
16 NY 2001 Linda 0
17 NY 2001 Carl 0
18 NY 2001 Jim 0
使用评论中显示的另一个数据
df2 <- structure(list(state = c("AK", "AK", "AK", "AK", "AK", "AK",
"AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK",
"AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK"), year = c(1997L,
1998L, 1995L, 1996L, 1997L, 1995L, 1996L, 1998L, 1997L, 1998L,
1996L, 1995L, 1996L, 1997L, 1998L, 1995L, 1996L, 1997L, 1998L,
1995L, 1996L, 1995L, 1996L, 1997L, 1998L), name = c("A", "A", "A",
"A", "B", "B", "B", "B", "C", "C", "C", "C", "D", "D", "D", "E",
"E", "E", "E", "F", "F", "G", "G", "G", "G")), class = "data.frame",
row.names = c(NA, -25L))
df2 %>%
mutate(rn = row_number()) %>%
arrange(state, year, name) %>%
group_by(state) %>%
mutate(change = !duplicated(name)) %>%
group_by(year, .add = TRUE) %>%
mutate(
change = if(all(change)) NA_integer_ else +(change)) %>%
ungroup %>%
arrange(rn) %>%
select(-rn) %>%
as.data.frame
-输出
state year name change
1 AK 1997 A 0
2 AK 1998 A 0
3 AK 1995 A NA
4 AK 1996 A 0
5 AK 1997 B 0
6 AK 1995 B NA
7 AK 1996 B 0
8 AK 1998 B 0
9 AK 1997 C 0
10 AK 1998 C 0
11 AK 1996 C 0
12 AK 1995 C NA
13 AK 1996 D 1
14 AK 1997 D 0
15 AK 1998 D 0
16 AK 1995 E NA
17 AK 1996 E 0
18 AK 1997 E 0
19 AK 1998 E 0
20 AK 1995 F NA
21 AK 1996 F 0
22 AK 1995 G NA
23 AK 1996 G 0
24 AK 1997 G 0
25 AK 1998 G 0
A base R 方法省略了 NA
s
df2 <- split(df, df$year)
cbind(df, change=rep((!(df2$"2000"$name == df2$"2001"$name))*1, length(df2)))
state year name change
1 CA 2000 John 1
2 CA 2000 Paul 0
3 WI 2000 Sally 0
4 WI 2000 Mary 0
5 NY 2000 Fred 1
6 NY 2000 Jane 0
7 CA 2000 Linda 0
8 CA 2000 Carl 0
9 WI 2000 Jim 0
10 WI 2001 Peter 1
11 NY 2001 Paul 0
12 NY 2001 Sally 0
13 CA 2001 Mary 0
14 CA 2001 Kate 1
15 WI 2001 Jane 0
16 WI 2001 Linda 0
17 NY 2001 Carl 0
18 NY 2001 Jim 0
我正在尝试弄清楚如何识别组内的名称更改。
例如,我有一个如下所示的数据框:
df <- data.frame(
state = rep(c("CA", "WI", "NY"), each = 3),
year = rep(c(2000, 2001), each = 9),
name = c("John", "Paul", "Sally",
"Mary", "Fred", "Jane",
"Linda", "Carl", "Jim",
"Peter", "Paul", "Sally",
"Mary", "Kate", "Jane",
"Linda", "Carl", "Jim")
)
> df
state year name
1 CA 2000 John
2 CA 2000 Paul
3 CA 2000 Sally
4 WI 2000 Mary
5 WI 2000 Fred
6 WI 2000 Jane
7 NY 2000 Linda
8 NY 2000 Carl
9 NY 2000 Jim
10 CA 2001 Peter
11 CA 2001 Paul
12 CA 2001 Sally
13 WI 2001 Mary
14 WI 2001 Kate
15 WI 2001 Jane
16 NY 2001 Linda
17 NY 2001 Carl
18 NY 2001 Jim
如您所见,“Peter”在 2001 年取代了“John”,“Kate”在 2001 年取代了“Fred”。
所以我希望输出看起来像:
df <- data.frame(
state = rep(c("CA", "WI", "NY"), each = 3),
year = rep(c(2000, 2001), each = 9),
name = c("John", "Paul", "Sally",
"Mary", "Fred", "Jane",
"Linda", "Carl", "Jim",
"Peter", "Paul", "Sally",
"Mary", "Kate", "Jane",
"Linda", "Carl", "Jim"),
change = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
1, 0, 0, 0, 1, 0, 0, 0, 0)
)
> df
state year name change
1 CA 2000 John NA
2 CA 2000 Paul NA
3 CA 2000 Sally NA
4 WI 2000 Mary NA
5 WI 2000 Fred NA
6 WI 2000 Jane NA
7 NY 2000 Linda NA
8 NY 2000 Carl NA
9 NY 2000 Jim NA
10 CA 2001 Peter 1
11 CA 2001 Paul 0
12 CA 2001 Sally 0
13 WI 2001 Mary 0
14 WI 2001 Kate 1
15 WI 2001 Jane 0
16 NY 2001 Linda 0
17 NY 2001 Carl 0
18 NY 2001 Jim 0
如您所见,2001 年的 Peter 和 2001 年的 Kate 在“更改”列中均标记为“1”,因为他们分别替换了 2000-CA 和 2000-NY 的“John”和“Fred” .
我一直在考虑使用一些滞后方法,但它似乎只查看前一行,而不是按州、年组:
df2 <- df %>%
group_by(state, year) %>%
mutate(change = lag(name, order_by = year))
如有任何帮助,我们将不胜感激!
根据预期的输出,也许这有帮助——根据整个数据中的duplicated
'name'创建一个逻辑列,然后按'year'、[=15=分组] all
值为 FALSE (!change
),然后替换为 NA
或 else
将逻辑转换为二进制 (+
)
library(dplyr)
df %>%
mutate(rn = row_number()) %>%
arrange(state, year, name) %>%
group_by(state) %>%
mutate(change = !duplicated(name)) %>%
group_by(year, .add = TRUE) %>%
mutate(
change = if(all(change)) NA_integer_ else +(change)) %>%
ungroup %>%
arrange(rn) %>%
select(-rn)
-输出
# A tibble: 18 × 4
state year name change
<chr> <dbl> <chr> <int>
1 CA 2000 John NA
2 CA 2000 Paul NA
3 CA 2000 Sally NA
4 WI 2000 Mary NA
5 WI 2000 Fred NA
6 WI 2000 Jane NA
7 NY 2000 Linda NA
8 NY 2000 Carl NA
9 NY 2000 Jim NA
10 CA 2001 Peter 1
11 CA 2001 Paul 0
12 CA 2001 Sally 0
13 WI 2001 Mary 0
14 WI 2001 Kate 1
15 WI 2001 Jane 0
16 NY 2001 Linda 0
17 NY 2001 Carl 0
18 NY 2001 Jim 0
使用评论中显示的另一个数据
df2 <- structure(list(state = c("AK", "AK", "AK", "AK", "AK", "AK",
"AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK",
"AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK"), year = c(1997L,
1998L, 1995L, 1996L, 1997L, 1995L, 1996L, 1998L, 1997L, 1998L,
1996L, 1995L, 1996L, 1997L, 1998L, 1995L, 1996L, 1997L, 1998L,
1995L, 1996L, 1995L, 1996L, 1997L, 1998L), name = c("A", "A", "A",
"A", "B", "B", "B", "B", "C", "C", "C", "C", "D", "D", "D", "E",
"E", "E", "E", "F", "F", "G", "G", "G", "G")), class = "data.frame",
row.names = c(NA, -25L))
df2 %>%
mutate(rn = row_number()) %>%
arrange(state, year, name) %>%
group_by(state) %>%
mutate(change = !duplicated(name)) %>%
group_by(year, .add = TRUE) %>%
mutate(
change = if(all(change)) NA_integer_ else +(change)) %>%
ungroup %>%
arrange(rn) %>%
select(-rn) %>%
as.data.frame
-输出
state year name change
1 AK 1997 A 0
2 AK 1998 A 0
3 AK 1995 A NA
4 AK 1996 A 0
5 AK 1997 B 0
6 AK 1995 B NA
7 AK 1996 B 0
8 AK 1998 B 0
9 AK 1997 C 0
10 AK 1998 C 0
11 AK 1996 C 0
12 AK 1995 C NA
13 AK 1996 D 1
14 AK 1997 D 0
15 AK 1998 D 0
16 AK 1995 E NA
17 AK 1996 E 0
18 AK 1997 E 0
19 AK 1998 E 0
20 AK 1995 F NA
21 AK 1996 F 0
22 AK 1995 G NA
23 AK 1996 G 0
24 AK 1997 G 0
25 AK 1998 G 0
A base R 方法省略了 NA
s
df2 <- split(df, df$year)
cbind(df, change=rep((!(df2$"2000"$name == df2$"2001"$name))*1, length(df2)))
state year name change
1 CA 2000 John 1
2 CA 2000 Paul 0
3 WI 2000 Sally 0
4 WI 2000 Mary 0
5 NY 2000 Fred 1
6 NY 2000 Jane 0
7 CA 2000 Linda 0
8 CA 2000 Carl 0
9 WI 2000 Jim 0
10 WI 2001 Peter 1
11 NY 2001 Paul 0
12 NY 2001 Sally 0
13 CA 2001 Mary 0
14 CA 2001 Kate 1
15 WI 2001 Jane 0
16 WI 2001 Linda 0
17 NY 2001 Carl 0
18 NY 2001 Jim 0