确定子组索引

Determine subgroup index

我有一个包含组和子组的大型数据框。我想确定每个组中子组的索引,如以下数据框的 OUTPUT 列所示:

df <- data.frame(
  Group = factor(c("A","A","A","A","A","B","B","B","B")),
  Subgroup = factor(c("a","a","b","b","b","a","a","b","b")),
  OUTPUT = c(1,1,2,2,2,1,1,2,2)
)

我尝试了几种可能性,但都没有成功。我想与 dplyr 一起工作,但我不确定该怎么做。下面的代码returns 出乎意料的结果。

require(dplyr)

df <- df %>%
  group_by(Group) %>%
  mutate(
    OUTPUT_2 = dplyr::id(Subgroup)
  )

#df
#   Group Subgroup OUTPUT_2
#  (fctr)   (fctr)    (int)
#1      A        a        8
#2      A        a        8
#3      A        b        8
#4      A        b        8
#5      A        b        8
#6      B        a        4
#7      B        a        4
#8      B        b        4
#9      B        b        4

我觉得我很接近,但还没有到那儿。有人可以帮忙吗?

library(data.table)
dt = as.data.table(df) # or setDT to convert in place

unique(dt[, .(Group, Subgroup)])[, idx := 1:.N, by = Group][dt, on = c('Group', 'Subgroup')]
#   Group Subgroup idx OUTPUT
#1:     A        a   1      1
#2:     A        a   1      1
#3:     A        b   2      2
#4:     A        b   2      2
#5:     A        b   2      2
#6:     B        a   1      1
#7:     B        a   1      1
#8:     B        b   2      2
#9:     B        b   2      2

翻译成 dplyr 应该很简单。


另一种方法,遵循使用 aosmith 评论中的因素的想法,是:

dt[, idx := as.integer(factor(Subgroup, unique(Subgroup))), by = Group][]

这将为每个组创建一个具有正确级别的因子,您所追求的索引。

这是一个 data.table 没有聚合的解决方案:

dt[order(Subgroup), Output := cumsum(!duplicated(Subgroup)) , by = .(Group)]

与基于聚合的方法相比,这会快得多。

我们可以将 factor 路由与 dplyr

一起使用
library(dplyr)
df %>% 
    group_by(Group) %>%
    mutate(OUTPUT = as.numeric(factor(Subgroup, levels= unique(Subgroup))))
#   Group Subgroup OUTPUT
#  <fctr>   <fctr>  <dbl>
#1      A        a      1
#2      A        a      1
#3      A        b      2
#4      A        b      2
#5      A        b      2
#6      B        a      1
#7      B        a      1
#8      B        b      2
#9      B        b      2

或者另一种选择是 match,按 'Group'

分组后 'Subgroup' 的 unique 个元素
df %>%
   group_by(Group) %>% 
   mutate(OUTPUT = match(Subgroup, unique(Subgroup)) )
#   Group Subgroup OUTPUT
#  <fctr>   <fctr>  <int>
#1      A        a      1
#2      A        a      1
#3      A        b      2
#4      A        b      2
#5      A        b      2
#6      B        a      1
#7      B        a      1
#8      B        b      2
#9      B        b      2