删除重复的 id 和条件的子集
Subset to remove duplicate id and condition
如果这是我的数据集
Id Weight Category
1 10.2 Pre
1 12.1 Post
2 11.3 Post
3 12.9 Pre
4 10.3 Post
4 12.3 Pre
5 11.8 Pre
如何删除同样为 Category=Pre 的重复 ID。我最终的预期数据集是
Id Weight Category
1 12.1 Post
2 11.3 Post
3 12.9 Pre
4 10.3 Post
5 11.8 Pre
您可以整理数据,然后使用distinct
。
library(dplyr)
df %>% arrange(Id, Category) %>% distinct(Id, .keep_all = TRUE)
# Id Weight Category
#1 1 12.1 Post
#2 2 11.3 Post
#3 3 12.9 Pre
#4 4 10.3 Post
#5 5 11.8 Pre
之所以有效,是因为 'Pre' > 'Post'
.
使用 by
,将 dat
拆分为 Id
和 select Post
,然后 rbind
结果。
do.call(rbind, by(dat, dat$Id, function(x)
if (nrow(x) == 2) x[x$Category == 'Post', ] else x))
# Id Weight Category
# 1 1 12.1 Post
# 2 2 11.3 Post
# 3 3 12.9 Pre
# 4 4 10.3 Post
# 5 5 11.8 Pre
数据:
dat <- read.table(header=T, text='
Id Weight Category
1 10.2 Pre
1 12.1 Post
2 11.3 Post
3 12.9 Pre
4 10.3 Post
4 12.3 Pre
5 11.8 Pre
')
我们可以在使用 first()
分组和排列后使用 filter
,因为 Post
在 Pre
之前:
df %>%
group_by(Id) %>%
arrange(Id, Category) %>%
filter(Category ==first(Category))
输出:
Id Weight Category
<int> <dbl> <chr>
1 1 12.1 Post
2 2 11.3 Post
3 3 12.9 Pre
4 4 10.3 Post
5 5 11.8 Pre
使用 base R
中的 subset
subset(df[with(df, order(Id, Category == 'Pre')),], !duplicated(Id))
Id Weight Category
2 1 12.1 Post
3 2 11.3 Post
4 3 12.9 Pre
5 4 10.3 Post
7 5 11.8 Pre
数据
df <- structure(list(Id = c(1L, 1L, 2L, 3L, 4L, 4L, 5L), Weight = c(10.2,
12.1, 11.3, 12.9, 10.3, 12.3, 11.8), Category = c("Pre", "Post",
"Post", "Pre", "Post", "Pre", "Pre")), class = "data.frame",
row.names = c(NA,
-7L))
如果这是我的数据集
Id Weight Category
1 10.2 Pre
1 12.1 Post
2 11.3 Post
3 12.9 Pre
4 10.3 Post
4 12.3 Pre
5 11.8 Pre
如何删除同样为 Category=Pre 的重复 ID。我最终的预期数据集是
Id Weight Category
1 12.1 Post
2 11.3 Post
3 12.9 Pre
4 10.3 Post
5 11.8 Pre
您可以整理数据,然后使用distinct
。
library(dplyr)
df %>% arrange(Id, Category) %>% distinct(Id, .keep_all = TRUE)
# Id Weight Category
#1 1 12.1 Post
#2 2 11.3 Post
#3 3 12.9 Pre
#4 4 10.3 Post
#5 5 11.8 Pre
之所以有效,是因为 'Pre' > 'Post'
.
使用 by
,将 dat
拆分为 Id
和 select Post
,然后 rbind
结果。
do.call(rbind, by(dat, dat$Id, function(x)
if (nrow(x) == 2) x[x$Category == 'Post', ] else x))
# Id Weight Category
# 1 1 12.1 Post
# 2 2 11.3 Post
# 3 3 12.9 Pre
# 4 4 10.3 Post
# 5 5 11.8 Pre
数据:
dat <- read.table(header=T, text='
Id Weight Category
1 10.2 Pre
1 12.1 Post
2 11.3 Post
3 12.9 Pre
4 10.3 Post
4 12.3 Pre
5 11.8 Pre
')
我们可以在使用 first()
分组和排列后使用 filter
,因为 Post
在 Pre
之前:
df %>%
group_by(Id) %>%
arrange(Id, Category) %>%
filter(Category ==first(Category))
输出:
Id Weight Category
<int> <dbl> <chr>
1 1 12.1 Post
2 2 11.3 Post
3 3 12.9 Pre
4 4 10.3 Post
5 5 11.8 Pre
使用 base R
subset
subset(df[with(df, order(Id, Category == 'Pre')),], !duplicated(Id))
Id Weight Category
2 1 12.1 Post
3 2 11.3 Post
4 3 12.9 Pre
5 4 10.3 Post
7 5 11.8 Pre
数据
df <- structure(list(Id = c(1L, 1L, 2L, 3L, 4L, 4L, 5L), Weight = c(10.2,
12.1, 11.3, 12.9, 10.3, 12.3, 11.8), Category = c("Pre", "Post",
"Post", "Pre", "Post", "Pre", "Pre")), class = "data.frame",
row.names = c(NA,
-7L))