select R 中具有 data.table 的重复组的最新行
select most recent rows for duplicate groups with data.table in R
我有一个包含重复记录的数据集,可以由一组来确定。我想将最早记录(按日期)之后的所有内容标记为重复项(如果日期相同,则为第一个 row.id)。
library(data.table)
library(lubridate)
groupA <- c("A","B","C","A","B","C","D","E","A")
groupB <- c("y","n","n","y","y","n","y","n","y")
#ymd format
date <- c("2017-04-01","2017-02-01","2017-03-01","2017-01-01","2017-05-01","2017-03-01","2017-07-01","2017-08-01","2017-09-01")
mydata <- data.table(groupA, groupB, date=ymd(date))
check.dups <- mydata[,.("count"=.N),by=.(groupA,groupB)]
#These are the duplicate keys
check.dups <- check.dups[count>1,]
#Create dupliate.flag on most recent example for duplicates
keycols <- c("groupA","groupB")
setkeyv(mydata, keycols)
setkeyv(check.dups, keycols)
我坚持在最早 date/first row.id 之后选择行的逻辑来创建重复标志。
#Select rows for duplicate flag
mydata[check.dups,][date > min(date),dup.flag := ]
非常感谢任何帮助。
预期输出:
A 标记是由于日期,C 标记是因为 row.id(日期相同)
groupA groupB date dup.flag
A y 2017-04-01 y
B n 2017-02-01 NA
C n 2017-03-01 NA
A y 2017-01-01 NA
B y 2017-05-01 NA
C n 2017-03-01 y
D y 2017-07-01 NA
E n 2017-08-01 NA
A y 2017-09-01 y
请尝试 data.table
包中的 duplicated()
函数:
setkey(mydata, groupA, groupB, date)
mydata[, dup := duplicated(mydata, by = c("groupA", "groupB"))]
mydata
# groupA groupB date dup
#1: A y 2017-01-01 FALSE
#2: A y 2017-04-01 TRUE
#3: A y 2017-09-01 TRUE
#4: B n 2017-02-01 FALSE
#5: B y 2017-05-01 FALSE
#6: C n 2017-03-01 FALSE
#7: C n 2017-03-01 TRUE
#8: D y 2017-07-01 FALSE
#9: E n 2017-08-01 FALSE
我有一个包含重复记录的数据集,可以由一组来确定。我想将最早记录(按日期)之后的所有内容标记为重复项(如果日期相同,则为第一个 row.id)。
library(data.table)
library(lubridate)
groupA <- c("A","B","C","A","B","C","D","E","A")
groupB <- c("y","n","n","y","y","n","y","n","y")
#ymd format
date <- c("2017-04-01","2017-02-01","2017-03-01","2017-01-01","2017-05-01","2017-03-01","2017-07-01","2017-08-01","2017-09-01")
mydata <- data.table(groupA, groupB, date=ymd(date))
check.dups <- mydata[,.("count"=.N),by=.(groupA,groupB)]
#These are the duplicate keys
check.dups <- check.dups[count>1,]
#Create dupliate.flag on most recent example for duplicates
keycols <- c("groupA","groupB")
setkeyv(mydata, keycols)
setkeyv(check.dups, keycols)
我坚持在最早 date/first row.id 之后选择行的逻辑来创建重复标志。
#Select rows for duplicate flag
mydata[check.dups,][date > min(date),dup.flag := ]
非常感谢任何帮助。
预期输出:
A 标记是由于日期,C 标记是因为 row.id(日期相同)
groupA groupB date dup.flag
A y 2017-04-01 y
B n 2017-02-01 NA
C n 2017-03-01 NA
A y 2017-01-01 NA
B y 2017-05-01 NA
C n 2017-03-01 y
D y 2017-07-01 NA
E n 2017-08-01 NA
A y 2017-09-01 y
请尝试 data.table
包中的 duplicated()
函数:
setkey(mydata, groupA, groupB, date)
mydata[, dup := duplicated(mydata, by = c("groupA", "groupB"))]
mydata
# groupA groupB date dup
#1: A y 2017-01-01 FALSE
#2: A y 2017-04-01 TRUE
#3: A y 2017-09-01 TRUE
#4: B n 2017-02-01 FALSE
#5: B y 2017-05-01 FALSE
#6: C n 2017-03-01 FALSE
#7: C n 2017-03-01 TRUE
#8: D y 2017-07-01 FALSE
#9: E n 2017-08-01 FALSE