计算R中data.table生命超过i天的个体数

Calculate the number of individual with more than i days of life with data.table in R

这是我的简化版 data.table:

Individual time_alive (day)
ID1 1
ID2 5
ID3 7
ID4 5

我需要计算每天存活的个体数。 我通过循环

实现了这一点
for (i in c(-1:600)) {
  y<-summarise(DT , time_alive > i )
  Alive[i+2,]<-length(y[y==TRUE])
}

然而,这真的很长,data.frame 超过 2B 次观察。

我想尝试使用 data.table 的替代方案,但我被困在只有 1 天的存活计算中:

DT[,.N,time_alive> i][time_alive==TRUE,2] 

这里,i不能用向量代替,只能用1个数字代替。我想计算生命超过 i 天的个体数量,而不做循环。

我对简化数据的预期结果是:

Day Number of individual alive
1 4
2 3
3 3
4 3
5 3
6 1
7 1
8 0

一行中的最佳解决方案,data.table比循环快得多:

DT[, .(Day = seq_len(1 + max(time_alive)))][DT[,.(time_alive)], .(.N), on = .(Day <= time_alive), by = Day]

我会以不同的方式解决问题。

如果您 运行 data.frame(Alive = cumsum(rev(table(c(1,5,7,5)))))(或者在您的一般情况下 data.frame(Alive = cumsum(rev(table(DT$time_alive))))),您将获得所需的信息,唯一需要注意的是,如果有任何一天没有死亡,你最终会在数据中出现差距。

# @r2evans suggestion about making it a one-liner
# replaced res = data.table('day' = 1:max(DT$time_alive))
DT[, .(day = seq_len(1 + max(time_alive)))][
     # my original solution
     DT, .(.N) ,on = .(day <= time_alive),by = day, allow.cartesian = T]

# or 

DT[,time_alive > TARGET_NUMBER, by = individual]

根据您提供的内容,我有两个解决方案。它们中的一个或两个应该是您要查找的内容。 details/explanation

见下文
# load in data
DT = data.table('individual' = 1:4, 'time_alive' = c(1,5,7,5))
# set your target number
TARGET_NUMBER = 5

# group by individual, 
# then check if the number of days they were alive is greater than your target
# this answers "i want to calculate the number of 
# individual with more than "i" days of life

DT[,time_alive > TARGET_NUMBER, by = individual]

individual    V1
1:          1 FALSE
2:          2 FALSE
3:          3  TRUE
4:          4 FALSE

# if the result you want is that table you created. that is a little different:
# create a table with days ranging from 1 to the maximum survivor

res = data.table('day' = 1:max(DT$time_alive))

day
1:   1
2:   2
3:   3
4:   4
5:   5
6:   6
7:   7


# use joins
# join by time alive being greater than or equal to the day
# group by the specific day, and count how many observations we have
# allow.cartesian because the mapping isn't one-to-one

res[DT, .(.N) ,on = .(day <= time_alive),by = day, allow.cartesian = T]

day N
1:   1 4
2:   2 3
3:   3 3
4:   4 3
5:   5 3
6:   6 1
7:   7 1

data.table

library(data.table)
DT[, .(Day = seq_len(max(time_alive) + 1))
  ][, Number := rowSums(outer(Day, DT$time_alive, `<=`))]
#      Day Number
#    <int>  <num>
# 1:     1      4
# 2:     2      3
# 3:     3      3
# 4:     4      3
# 5:     5      3
# 6:     6      1
# 7:     7      1
# 8:     8      0

(我假设 DT 每个 Individual 不超过 1 行。)


数据

DT <- setDT(structure(list(Individual = c("ID1", "ID2", "ID3", "ID4"), time_alive = c(1L, 5L, 7L, 5L)), class = c("data.table", "data.frame"), row.names = c(NA, -4L)))