确定子组索引
Determine subgroup index
我有一个包含组和子组的大型数据框。我想确定每个组中子组的索引,如以下数据框的 OUTPUT
列所示:
df <- data.frame(
Group = factor(c("A","A","A","A","A","B","B","B","B")),
Subgroup = factor(c("a","a","b","b","b","a","a","b","b")),
OUTPUT = c(1,1,2,2,2,1,1,2,2)
)
我尝试了几种可能性,但都没有成功。我想与 dplyr
一起工作,但我不确定该怎么做。下面的代码returns 出乎意料的结果。
require(dplyr)
df <- df %>%
group_by(Group) %>%
mutate(
OUTPUT_2 = dplyr::id(Subgroup)
)
#df
# Group Subgroup OUTPUT_2
# (fctr) (fctr) (int)
#1 A a 8
#2 A a 8
#3 A b 8
#4 A b 8
#5 A b 8
#6 B a 4
#7 B a 4
#8 B b 4
#9 B b 4
我觉得我很接近,但还没有到那儿。有人可以帮忙吗?
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
unique(dt[, .(Group, Subgroup)])[, idx := 1:.N, by = Group][dt, on = c('Group', 'Subgroup')]
# Group Subgroup idx OUTPUT
#1: A a 1 1
#2: A a 1 1
#3: A b 2 2
#4: A b 2 2
#5: A b 2 2
#6: B a 1 1
#7: B a 1 1
#8: B b 2 2
#9: B b 2 2
翻译成 dplyr
应该很简单。
另一种方法,遵循使用 aosmith 评论中的因素的想法,是:
dt[, idx := as.integer(factor(Subgroup, unique(Subgroup))), by = Group][]
这将为每个组创建一个具有正确级别的因子,是您所追求的索引。
这是一个 data.table
没有聚合的解决方案:
dt[order(Subgroup), Output := cumsum(!duplicated(Subgroup)) , by = .(Group)]
与基于聚合的方法相比,这会快得多。
我们可以将 factor
路由与 dplyr
一起使用
library(dplyr)
df %>%
group_by(Group) %>%
mutate(OUTPUT = as.numeric(factor(Subgroup, levels= unique(Subgroup))))
# Group Subgroup OUTPUT
# <fctr> <fctr> <dbl>
#1 A a 1
#2 A a 1
#3 A b 2
#4 A b 2
#5 A b 2
#6 B a 1
#7 B a 1
#8 B b 2
#9 B b 2
或者另一种选择是 match
,按 'Group'
分组后 'Subgroup' 的 unique
个元素
df %>%
group_by(Group) %>%
mutate(OUTPUT = match(Subgroup, unique(Subgroup)) )
# Group Subgroup OUTPUT
# <fctr> <fctr> <int>
#1 A a 1
#2 A a 1
#3 A b 2
#4 A b 2
#5 A b 2
#6 B a 1
#7 B a 1
#8 B b 2
#9 B b 2
我有一个包含组和子组的大型数据框。我想确定每个组中子组的索引,如以下数据框的 OUTPUT
列所示:
df <- data.frame(
Group = factor(c("A","A","A","A","A","B","B","B","B")),
Subgroup = factor(c("a","a","b","b","b","a","a","b","b")),
OUTPUT = c(1,1,2,2,2,1,1,2,2)
)
我尝试了几种可能性,但都没有成功。我想与 dplyr
一起工作,但我不确定该怎么做。下面的代码returns 出乎意料的结果。
require(dplyr)
df <- df %>%
group_by(Group) %>%
mutate(
OUTPUT_2 = dplyr::id(Subgroup)
)
#df
# Group Subgroup OUTPUT_2
# (fctr) (fctr) (int)
#1 A a 8
#2 A a 8
#3 A b 8
#4 A b 8
#5 A b 8
#6 B a 4
#7 B a 4
#8 B b 4
#9 B b 4
我觉得我很接近,但还没有到那儿。有人可以帮忙吗?
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
unique(dt[, .(Group, Subgroup)])[, idx := 1:.N, by = Group][dt, on = c('Group', 'Subgroup')]
# Group Subgroup idx OUTPUT
#1: A a 1 1
#2: A a 1 1
#3: A b 2 2
#4: A b 2 2
#5: A b 2 2
#6: B a 1 1
#7: B a 1 1
#8: B b 2 2
#9: B b 2 2
翻译成 dplyr
应该很简单。
另一种方法,遵循使用 aosmith 评论中的因素的想法,是:
dt[, idx := as.integer(factor(Subgroup, unique(Subgroup))), by = Group][]
这将为每个组创建一个具有正确级别的因子,是您所追求的索引。
这是一个 data.table
没有聚合的解决方案:
dt[order(Subgroup), Output := cumsum(!duplicated(Subgroup)) , by = .(Group)]
与基于聚合的方法相比,这会快得多。
我们可以将 factor
路由与 dplyr
library(dplyr)
df %>%
group_by(Group) %>%
mutate(OUTPUT = as.numeric(factor(Subgroup, levels= unique(Subgroup))))
# Group Subgroup OUTPUT
# <fctr> <fctr> <dbl>
#1 A a 1
#2 A a 1
#3 A b 2
#4 A b 2
#5 A b 2
#6 B a 1
#7 B a 1
#8 B b 2
#9 B b 2
或者另一种选择是 match
,按 'Group'
unique
个元素
df %>%
group_by(Group) %>%
mutate(OUTPUT = match(Subgroup, unique(Subgroup)) )
# Group Subgroup OUTPUT
# <fctr> <fctr> <int>
#1 A a 1
#2 A a 1
#3 A b 2
#4 A b 2
#5 A b 2
#6 B a 1
#7 B a 1
#8 B b 2
#9 B b 2