r - 为数据框中类别(由 2+ 个字段定义)中的每一行创建一个序列号
r - Create a sequence number for each row within a category (defined by 2+ fields) in a dataframe
我想在数据帧的每个 group/subset 中生成一个 ID 号,其中每个组由两个或更多字段定义。在此测试数据集中,我想使用 "personid" 和 "date" 作为我的类别:
personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23
我希望为两列"personid"和"date"的每个唯一组合添加一个 id 列,始终从 1 开始。这是我想要的输出:
personid date measurement id
1 x 23 1
1 x 32 1
2 y 21 1
3 x 23 1
3 z 23 2
3 y 23 3
这是一个与 3 年前版本类似的问题
Create a sequential number (counter) for rows within each group of a dataframe 但经过多次尝试后,我无法为我的 2+ 字段类别定义扩展他们的逻辑。谢谢!
与@Procrastinatus Maximus 的 rleid
相同的想法,这是它的 dplyr
版本:
library(dplyr)
df %>%
arrange(personid, date) %>%
group_by(personid) %>%
mutate(id = cumsum(date != lag(date, default = first(date))) + 1)
# +1 converts the zero based id to one based id here
# Source: local data frame [6 x 4]
# Groups: personid [3]
#
# personid date measurement id
# <int> <fctr> <int> <dbl>
# 1 1 x 23 1
# 2 1 x 32 1
# 3 2 y 21 1
# 4 3 x 23 1
# 5 3 y 23 2
# 6 3 z 23 3
为了让 rleid
或 cumsum
在这里工作,我们必须先按 personid
排序数据框,然后再按 date
排序,因为这两种方法只关心相邻的值。
这是一种方法:
df <- data.frame(personid = c(1,1,2,3,3,3),
date = c("x","x","y","x","z","y"),
measurement = c(23,32,31,23,23,23))
#This should create a unique character string for each personid-date pair:
idChar <- paste(df$personid, df$date, sep = ".")
#unique() preserves the order of the first appearance of each pair,
#and match() tells the index of each pair in unique(idChar) for each idChar:
df$id <- match(idChar, unique(idChar))
data.table
包的两种可能性:
library(data.table)
# option 1
setDT(df)[, id := frank(date, ties.method = 'dense'), by = personid][]
# option 2
setDT(df)[, id := rleid(date), by = personid]
给出:
personid date measurement id
1: 1 x 23 1
2: 1 x 32 1
3: 2 y 21 1
4: 3 x 23 1
5: 3 z 23 3
6: 3 y 23 2
我想在数据帧的每个 group/subset 中生成一个 ID 号,其中每个组由两个或更多字段定义。在此测试数据集中,我想使用 "personid" 和 "date" 作为我的类别:
personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23
我希望为两列"personid"和"date"的每个唯一组合添加一个 id 列,始终从 1 开始。这是我想要的输出:
personid date measurement id
1 x 23 1
1 x 32 1
2 y 21 1
3 x 23 1
3 z 23 2
3 y 23 3
这是一个与 3 年前版本类似的问题 Create a sequential number (counter) for rows within each group of a dataframe 但经过多次尝试后,我无法为我的 2+ 字段类别定义扩展他们的逻辑。谢谢!
与@Procrastinatus Maximus 的 rleid
相同的想法,这是它的 dplyr
版本:
library(dplyr)
df %>%
arrange(personid, date) %>%
group_by(personid) %>%
mutate(id = cumsum(date != lag(date, default = first(date))) + 1)
# +1 converts the zero based id to one based id here
# Source: local data frame [6 x 4]
# Groups: personid [3]
#
# personid date measurement id
# <int> <fctr> <int> <dbl>
# 1 1 x 23 1
# 2 1 x 32 1
# 3 2 y 21 1
# 4 3 x 23 1
# 5 3 y 23 2
# 6 3 z 23 3
为了让 rleid
或 cumsum
在这里工作,我们必须先按 personid
排序数据框,然后再按 date
排序,因为这两种方法只关心相邻的值。
这是一种方法:
df <- data.frame(personid = c(1,1,2,3,3,3),
date = c("x","x","y","x","z","y"),
measurement = c(23,32,31,23,23,23))
#This should create a unique character string for each personid-date pair:
idChar <- paste(df$personid, df$date, sep = ".")
#unique() preserves the order of the first appearance of each pair,
#and match() tells the index of each pair in unique(idChar) for each idChar:
df$id <- match(idChar, unique(idChar))
data.table
包的两种可能性:
library(data.table)
# option 1
setDT(df)[, id := frank(date, ties.method = 'dense'), by = personid][]
# option 2
setDT(df)[, id := rleid(date), by = personid]
给出:
personid date measurement id
1: 1 x 23 1
2: 1 x 32 1
3: 2 y 21 1
4: 3 x 23 1
5: 3 z 23 3
6: 3 y 23 2