r - 为数据框中类别(由 2+ 个字段定义)中的每一行创建一个序列号

r - Create a sequence number for each row within a category (defined by 2+ fields) in a dataframe

我想在数据帧的每个 group/subset 中生成一个 ID 号,其中每个组由两个或更多字段定义。在此测试数据集中,我想使用 "personid" 和 "date" 作为我的类别:

personid date measurement 
1         x     23
1         x     32
2         y     21
3         x     23
3         z     23
3         y     23

我希望为两列"personid"和"date"的每个唯一组合添加一个 id 列,始终从 1 开始。这是我想要的输出:

personid date measurement id
1         x     23         1
1         x     32         1
2         y     21         1
3         x     23         1
3         z     23         2
3         y     23         3

这是一个与 3 年前版本类似的问题 Create a sequential number (counter) for rows within each group of a dataframe 但经过多次尝试后,我无法为我的 2+ 字段类别定义扩展他们的逻辑。谢谢!

与@Procrastinatus Maximus 的 rleid 相同的想法,这是它的 dplyr 版本:

library(dplyr)
df %>% 
      arrange(personid, date) %>% 
      group_by(personid) %>% 
      mutate(id = cumsum(date != lag(date, default = first(date))) + 1)
      # +1 converts the zero based id to one based id here

# Source: local data frame [6 x 4]
# Groups: personid [3]
#
#   personid   date measurement    id
#      <int> <fctr>       <int> <dbl>
# 1        1      x          23     1
# 2        1      x          32     1
# 3        2      y          21     1
# 4        3      x          23     1
# 5        3      y          23     2
# 6        3      z          23     3

为了让 rleidcumsum 在这里工作,我们必须先按 personid 排序数据框,然后再按 date 排序,因为这两种方法只关心相邻的值。

这是一种方法:

df <- data.frame(personid = c(1,1,2,3,3,3), 
                 date = c("x","x","y","x","z","y"), 
                 measurement = c(23,32,31,23,23,23))

#This should create a unique character string for each personid-date pair:
idChar <- paste(df$personid, df$date, sep = ".")

#unique() preserves the order of the first appearance of each pair,
#and match() tells the index of each pair in unique(idChar) for each idChar:
df$id <- match(idChar, unique(idChar))

data.table 包的两种可能性:

library(data.table)
# option 1
setDT(df)[, id := frank(date, ties.method = 'dense'), by = personid][]
# option 2
setDT(df)[, id := rleid(date), by = personid]

给出:

   personid date measurement id
1:        1    x          23  1
2:        1    x          32  1
3:        2    y          21  1
4:        3    x          23  1
5:        3    z          23  3
6:        3    y          23  2