数据中有条件的子集 table
Subset with condition in data table
假设我们有这样的数据:
tmp <- data.table(id1 = c(1,1,1,1,2,2,2,3,3), time=c(1,2,3,4,1,2,3,1,2), user_id=c(1,1,1,1,2,2,2,1,1) )
对于每个 user_id
,当 id1 == max(id1)
.
时,我想要所有样本,但带有 time > 2
的样本除外
我现在使用以下代码,它给我这样的警告消息:
tmp1 <- tmp[, if (id1 == max(id1)) .SD[time <= 2,] else .SD , by="user_id"]
Warning messages:
1: In if (id1 == max(id1)) .SD[time <= 2, ] else .SD :
the condition has length > 1 and only the first element will be used
2: In if (id1 == max(id1)) .SD[time <= 2, ] else .SD :
the condition has length > 1 and only the first element will be used
估计是if else语句的vectorize问题。所以我将代码更改为以下内容:
tmp2 <- tmp[, ifelse(id1 == max(id1), .SD[time <= 2,] , .SD) , by="user_id"]
Error in `[.data.table`(tmp, , ifelse(id1 == max(id1), .SD[time <= 2, :
Supplied 4 items for column 5 of group 1 which has 6 rows. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.
如何更正我的代码?
谢谢!
你可以这样做:
library(data.table)
tmp[, .SD[!(id1 == max(id1) & time > 2)], user_id]
# user_id id1 time
#1: 1 1 1
#2: 1 1 2
#3: 1 1 3
#4: 1 1 4
#5: 1 3 1
#6: 1 3 2
#7: 2 2 1
#8: 2 2 2
使用 dplyr:
tmp %>% filter(!(id1 == max(id1)) & time > 2)
或以 R 为基数:
tmp[tmp$id1 != max(id1) & tmp$time > 2, ]
假设我们有这样的数据:
tmp <- data.table(id1 = c(1,1,1,1,2,2,2,3,3), time=c(1,2,3,4,1,2,3,1,2), user_id=c(1,1,1,1,2,2,2,1,1) )
对于每个 user_id
,当 id1 == max(id1)
.
time > 2
的样本除外
我现在使用以下代码,它给我这样的警告消息:
tmp1 <- tmp[, if (id1 == max(id1)) .SD[time <= 2,] else .SD , by="user_id"]
Warning messages:
1: In if (id1 == max(id1)) .SD[time <= 2, ] else .SD :
the condition has length > 1 and only the first element will be used
2: In if (id1 == max(id1)) .SD[time <= 2, ] else .SD :
the condition has length > 1 and only the first element will be used
估计是if else语句的vectorize问题。所以我将代码更改为以下内容:
tmp2 <- tmp[, ifelse(id1 == max(id1), .SD[time <= 2,] , .SD) , by="user_id"]
Error in `[.data.table`(tmp, , ifelse(id1 == max(id1), .SD[time <= 2, :
Supplied 4 items for column 5 of group 1 which has 6 rows. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.
如何更正我的代码?
谢谢!
你可以这样做:
library(data.table)
tmp[, .SD[!(id1 == max(id1) & time > 2)], user_id]
# user_id id1 time
#1: 1 1 1
#2: 1 1 2
#3: 1 1 3
#4: 1 1 4
#5: 1 3 1
#6: 1 3 2
#7: 2 2 1
#8: 2 2 2
使用 dplyr:
tmp %>% filter(!(id1 == max(id1)) & time > 2)
或以 R 为基数:
tmp[tmp$id1 != max(id1) & tmp$time > 2, ]