dplyr + group_by 并避免按字母顺序排序
dplyr + group_by and avoid alphabetical sorting
我有以下数据:
data <- structure(list(user = c(1234L, 1234L, 1234L, 1234L, 1234L, 1234L,
1234L, 1234L, 1234L, 1234L, 1234L, 4758L, 4758L, 9584L, 9584L,
9584L, 9584L, 9584L, 9584L), time = c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L), fruit = structure(c(1L,
6L, 1L, 1L, 6L, 5L, 5L, 3L, 4L, 1L, 2L, 4L, 2L, 1L, 6L, 5L, 5L,
3L, 2L), .Label = c("apple", "banana", "lemon", "lime", "orange",
"pear"), class = "factor"), count = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), cum_sum = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 1L, 2L, 1L, 2L, 3L,
4L, 5L, 6L)), .Names = c("user", "time", "fruit", "count", "cum_sum"
), row.names = c(NA, -19L), class = "data.frame")
对于这个集合中的每个用户,我想查看一段时间内水果的顺序。但是,一些水果被及时列出 "back to back"。
user time fruit count cum_sum
1 1234 1 apple 1 1
2 1234 2 pear 1 2
3 1234 3 apple 1 3
4 1234 4 apple 1 4
5 1234 5 pear 1 5
6 1234 6 orange 1 6
7 1234 7 orange 1 7
我正在寻找的更多是 unique 水果的用户时间序列。
问题是,如果我按用户和水果分组然后汇总,dplyr 会自动按字母顺序对水果进行排序:
data %>%
group_by(user, fruit) %>%
summarise(temp_var=1) %>%
mutate(cum_sum = cumsum(temp_var))
我真正想要的是,对于上面的用户1234(例如),水果按时间序列的顺序列出,但删除任何重复项。所以在我们看到 apple > pear > apple > apple > pear > orange > orange 的地方,我们只会看到 apple > pear > apple > pear > orange
根据您的示例,这可能会有所帮助:
data %>%
group_by(user) %>%
filter(c(T,fruit[-1L] != fruit[-length(fruit)])) %>%
mutate(cum_sum = cumsum(count),
time = seq_along(count))
# Source: local data frame [16 x 5]
# Groups: user
#
# user time fruit count cum_sum
# 1 1234 1 apple 1 1
# 2 1234 2 pear 1 2
# 3 1234 3 apple 1 3
# 4 1234 4 pear 1 4
# 5 1234 5 orange 1 5
# 6 1234 6 lemon 1 6
# 7 1234 7 lime 1 7
# 8 1234 8 apple 1 8
# 9 1234 9 banana 1 9
# 10 4758 1 lime 1 1
# 11 4758 2 banana 1 2
# 12 9584 1 apple 1 1
# 13 9584 2 pear 1 2
# 14 9584 3 orange 1 3
# 15 9584 4 lemon 1 4
# 16 9584 5 banana 1 5
因此,使用 CRAN 上最新 data.table
版本的 rleid
函数,我们可以简单地做到这一点(尽管不确定您想要的确切输出)
library(data.table) ## v >= 1.9.6
res <- setDT(data)[, .(fruit = fruit[1L]), by = .(user, indx = rleid(fruit))
][, cum_sum := seq_len(.N), by = user
][, indx := NULL]
res
# user fruit cum_sum
# 1: 1234 apple 1
# 2: 1234 pear 2
# 3: 1234 apple 3
# 4: 1234 pear 4
# 5: 1234 orange 5
# 6: 1234 lemon 6
# 7: 1234 lime 7
# 8: 1234 apple 8
# 9: 1234 banana 9
# 10: 4758 lime 1
# 11: 4758 banana 2
# 12: 9584 apple 1
# 13: 9584 pear 2
# 14: 9584 orange 3
# 15: 9584 lemon 4
# 16: 9584 banana 5
你可以使用group_indices
来处理这种情况:
data %>%
filter(group_indices_(., .dots = c("user", "fruit")) !=
lag(group_indices_(., .dots = c("user", "fruit")), default = 0)) %>%
group_by(user) %>%
mutate(cum_sum = row_number())
以与rleid
类似的方式,它为每个组生成一个唯一的 ID。您基本上使用 lag()
.
过滤掉与前一个具有相同 ID 的所有值
#Source: local data frame [16 x 3]
#Groups: user
#
# user fruit cum_sum
#1 1234 apple 1
#2 1234 pear 2
#3 1234 apple 3
#4 1234 pear 4
#5 1234 orange 5
#6 1234 lemon 6
#7 1234 lime 7
#8 1234 apple 8
#9 1234 banana 9
#10 4758 lime 1
#11 4758 banana 2
#12 9584 apple 1
#13 9584 pear 2
#14 9584 orange 3
#15 9584 lemon 4
#16 9584 banana 5
我有以下数据:
data <- structure(list(user = c(1234L, 1234L, 1234L, 1234L, 1234L, 1234L,
1234L, 1234L, 1234L, 1234L, 1234L, 4758L, 4758L, 9584L, 9584L,
9584L, 9584L, 9584L, 9584L), time = c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L), fruit = structure(c(1L,
6L, 1L, 1L, 6L, 5L, 5L, 3L, 4L, 1L, 2L, 4L, 2L, 1L, 6L, 5L, 5L,
3L, 2L), .Label = c("apple", "banana", "lemon", "lime", "orange",
"pear"), class = "factor"), count = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), cum_sum = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 1L, 2L, 1L, 2L, 3L,
4L, 5L, 6L)), .Names = c("user", "time", "fruit", "count", "cum_sum"
), row.names = c(NA, -19L), class = "data.frame")
对于这个集合中的每个用户,我想查看一段时间内水果的顺序。但是,一些水果被及时列出 "back to back"。
user time fruit count cum_sum
1 1234 1 apple 1 1
2 1234 2 pear 1 2
3 1234 3 apple 1 3
4 1234 4 apple 1 4
5 1234 5 pear 1 5
6 1234 6 orange 1 6
7 1234 7 orange 1 7
我正在寻找的更多是 unique 水果的用户时间序列。
问题是,如果我按用户和水果分组然后汇总,dplyr 会自动按字母顺序对水果进行排序:
data %>%
group_by(user, fruit) %>%
summarise(temp_var=1) %>%
mutate(cum_sum = cumsum(temp_var))
我真正想要的是,对于上面的用户1234(例如),水果按时间序列的顺序列出,但删除任何重复项。所以在我们看到 apple > pear > apple > apple > pear > orange > orange 的地方,我们只会看到 apple > pear > apple > pear > orange
根据您的示例,这可能会有所帮助:
data %>%
group_by(user) %>%
filter(c(T,fruit[-1L] != fruit[-length(fruit)])) %>%
mutate(cum_sum = cumsum(count),
time = seq_along(count))
# Source: local data frame [16 x 5]
# Groups: user
#
# user time fruit count cum_sum
# 1 1234 1 apple 1 1
# 2 1234 2 pear 1 2
# 3 1234 3 apple 1 3
# 4 1234 4 pear 1 4
# 5 1234 5 orange 1 5
# 6 1234 6 lemon 1 6
# 7 1234 7 lime 1 7
# 8 1234 8 apple 1 8
# 9 1234 9 banana 1 9
# 10 4758 1 lime 1 1
# 11 4758 2 banana 1 2
# 12 9584 1 apple 1 1
# 13 9584 2 pear 1 2
# 14 9584 3 orange 1 3
# 15 9584 4 lemon 1 4
# 16 9584 5 banana 1 5
因此,使用 CRAN 上最新 data.table
版本的 rleid
函数,我们可以简单地做到这一点(尽管不确定您想要的确切输出)
library(data.table) ## v >= 1.9.6
res <- setDT(data)[, .(fruit = fruit[1L]), by = .(user, indx = rleid(fruit))
][, cum_sum := seq_len(.N), by = user
][, indx := NULL]
res
# user fruit cum_sum
# 1: 1234 apple 1
# 2: 1234 pear 2
# 3: 1234 apple 3
# 4: 1234 pear 4
# 5: 1234 orange 5
# 6: 1234 lemon 6
# 7: 1234 lime 7
# 8: 1234 apple 8
# 9: 1234 banana 9
# 10: 4758 lime 1
# 11: 4758 banana 2
# 12: 9584 apple 1
# 13: 9584 pear 2
# 14: 9584 orange 3
# 15: 9584 lemon 4
# 16: 9584 banana 5
你可以使用group_indices
来处理这种情况:
data %>%
filter(group_indices_(., .dots = c("user", "fruit")) !=
lag(group_indices_(., .dots = c("user", "fruit")), default = 0)) %>%
group_by(user) %>%
mutate(cum_sum = row_number())
以与rleid
类似的方式,它为每个组生成一个唯一的 ID。您基本上使用 lag()
.
#Source: local data frame [16 x 3]
#Groups: user
#
# user fruit cum_sum
#1 1234 apple 1
#2 1234 pear 2
#3 1234 apple 3
#4 1234 pear 4
#5 1234 orange 5
#6 1234 lemon 6
#7 1234 lime 7
#8 1234 apple 8
#9 1234 banana 9
#10 4758 lime 1
#11 4758 banana 2
#12 9584 apple 1
#13 9584 pear 2
#14 9584 orange 3
#15 9584 lemon 4
#16 9584 banana 5