添加列以汇总满足条件的所有先前行
Add column that sums all previous rows that meet condition
假设我有大量 table 以下列
subject stim1 stim2 Chosen
1: 1 2 1 2
2: 1 3 2 2
3: 1 3 1 1
4: 1 2 3 3
5: 1 1 3 1
我正在寻找一种有效的方法(因为整个数据集很大)来改变另外两个列(按主题)
- stim1_seen, stim2_seen = 是当前 stim1 先前在 stim1 或 stim2 (stim1_seen) 或 stim2 先前在 stim1 或stim2 (stim2_seen).
- stim1_chosen, stim2_chosen= 是所有先前实例的总和,其中分别选择了当前 stim1 和当前 stim2。
期望的输出
subject stim1 stim2 Chosen stim1_chosen stim2_chosen
1: 1 2 1 2 0 0
2: 1 3 2 2 0 1
3: 1 3 1 1 0 0
4: 1 2 3 3 2 0
5: 1 1 3 1 1 1
6: 1 2 1 1 2 2
理想情况下,它会使用 data.table 或 dplyr。
这里是输出
structure(list(subject = c(1021, 1021, 1021, 1021, 1021, 1021
), stim1 = c(51L, 48L, 49L, 48L, 49L, 46L), stim2 = c(50L, 50L,
47L, 46L, 51L, 47L), Chosen = c(50L, 50L, 49L, 48L, 49L, 46L)), row.names = c(NA,
-6L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x7fc9ce8158e0>)
好的,这适用于示例数据。在我们有更多主题并且列中的值大于 1 的某些地方,运行 会很好。我假设它是一个名为 dt
的 data.table
对象
1.索引
使用 merge
操作更改行顺序真的很容易,所以永远不要依赖行号,而是通过 subject
创建 rowid
。 .N
是 length/number 行的 data.table 语法。
# order matters, so make a rowid
dt[, rowid := 1:.N, by=subject]
# sets orders and indexing to make it quicker
setkey(dt, subject, rowid)
2。看到 cols
需要将 stim1
和 stim2
合并到一列中。为此,请使用 melt
从宽格式变为长格式。
seen:=0:(.N-1)
然后按这些值分组以按行查找累积出现次数。但是当我们查看先验值时,我们减去 1。
然后我们进行两次合并,因为我们有兴趣将其与两个 stim cols 进行比较
# for seen, melt wide to long
dt_seen <- melt(dt,
id.vars = c("subject", "rowid"),
measure.vars = c("stim1", "stim2"))
# interested in finding occurences
dt_seen <- unique(dt_seen[, .(subject, rowid, value)])
setorder(dt_seen, rowid)
dt_seen[, seen:=0:(.N-1), by=.(subject, value)]
# merge across twice
dt <- merge(dt, dt_seen,
by.x=c("subject", "rowid", "stim1"),
by.y=c("subject", "rowid", "value"),
all.x=TRUE, sort=FALSE)
setnames(dt, "seen", "stim1_seen")
dt <- merge(dt, dt_seen,
by.x=c("subject", "rowid", "stim2"),
by.y=c("subject", "rowid", "value"),
all.x=TRUE, sort=FALSE)
setnames(dt, "seen", "stim2_seen")
dt[]
3。选择
我一直很懒惰,并且有效地完成了与第 (2) 节相同的操作,但首先过滤到 Chosen 与 stim 值匹配的行。一个一个地做而不是一起做,因为这些列是独立的。 stim1 和 stim2 的过程相同,因此可以稍微整理一下。
# turn Chosen from wide to long
dt_chosen <- melt(dt,
id.vars = c("subject", "rowid"),
measure.vars = c("Chosen"))
# interested in finding occurences
# need to expand
dt_chosen[, variable := NULL]
# going to expand the grid, so can look at e.g. value 50 for all rowids
library(tidyr)
dt_chosen[, chosen_row := 1]
dt_chosen_full <- expand(dt_chosen, nesting(subject, rowid), value) %>% setDT
# pull in the actual data and fill rest with 0's
dt_chosen_full <- merge(dt_chosen_full, dt_chosen, by=c("subject", "rowid", "value"),
all.x=TRUE)
dt_chosen_full[is.na(chosen_row), chosen_row := 0]
# use cumsum to identify now the cumulative count of these across the full row set
dt_chosen_full[, chosen := cumsum(chosen_row), by=.(subject, value)]
# as its prior, on the row itself, subtract one so the update happens after the row
dt_chosen_full[chosen_row==1, chosen := chosen-1]
# merge across twice
dt <- merge(dt, dt_chosen_full[, -"chosen_row"],
by.x=c("subject", "rowid", "stim1"),
by.y=c("subject", "rowid", "value"),
all.x=TRUE, sort=FALSE)
setnames(dt, "chosen", "stim1_chosen")
dt[is.na(stim1_chosen), stim1_chosen := 0]
dt <- merge(dt, dt_chosen_full[, -"chosen_row"],
by.x=c("subject", "rowid", "stim2"),
by.y=c("subject", "rowid", "value"),
all.x=TRUE, sort=FALSE)
setnames(dt, "chosen", "stim2_chosen")
dt[is.na(stim2_chosen), stim2_chosen := 0]
输出
dt[]
subject rowid stim2 stim1 Chosen stim1_seen stim2_seen stim1_chosen stim2_chosen
1: 1021 1 50 51 50 0 0 0 0
2: 1021 2 50 48 50 0 1 0 1
3: 1021 3 47 49 49 0 0 0 0
4: 1021 4 46 48 48 1 0 0 0
5: 1021 5 51 49 49 1 1 1 0
6: 1021 6 47 46 46 1 1 0 0
这是一个管道,在两个框架上都有演示。
dat1
是您显示一些预期输出的地方
dat1[, c("stim1_seen", "stim2_seen") :=
lapply(.SD, function(z) mapply(function(x, S) {
sum(stim1[S] %in% x | stim2[S] %in% x)
}, z, lapply(seq_len(.N)-1, seq_len))),
.SDcols = c("stim1", "stim2"), by = .(subject)
][, c("stim1_chosen", "stim2_chosen") :=
lapply(.SD, function(z) mapply(function(x, S) {
sum(Chosen[S] %in% x)
}, z, lapply(seq_len(.N)-1, seq_len))),
.SDcols = c("stim1", "stim2"), by = .(subject)]
# subject stim1 stim2 Chosen stim1_seen stim2_seen stim1_chosen stim2_chosen
# <int> <int> <int> <int> <int> <int> <int> <int>
# 1: 1 2 1 2 0 0 0 0
# 2: 1 3 2 2 0 1 0 1
# 3: 1 3 1 1 1 1 0 0
# 4: 1 2 3 3 2 2 2 0
# 5: 1 1 3 1 2 3 1 1
# 6: 1 2 1 1 3 3 2 2
dat2
是你的dput输出(不同的数据)
dat2[, c("stim1_seen", "stim2_seen") :=
lapply(.SD, function(z) mapply(function(x, S) {
sum(stim1[S] %in% x | stim2[S] %in% x)
}, z, lapply(seq_len(.N)-1, seq_len))),
.SDcols = c("stim1", "stim2"), by = .(subject)
][, c("stim1_chosen", "stim2_chosen") :=
lapply(.SD, function(z) mapply(function(x, S) {
sum(Chosen[S] %in% x)
}, z, lapply(seq_len(.N)-1, seq_len))),
.SDcols = c("stim1", "stim2"), by = .(subject)]
# subject stim1 stim2 Chosen stim1_seen stim2_seen stim1_chosen stim2_chosen
# <num> <int> <int> <int> <int> <int> <int> <int>
# 1: 1021 51 50 50 0 0 0 0
# 2: 1021 48 50 50 0 1 0 1
# 3: 1021 49 47 49 0 0 0 0
# 4: 1021 48 46 48 1 0 0 0
# 5: 1021 49 51 49 1 1 1 0
# 6: 1021 46 47 46 1 1 0 0
这里的重头戏是想做一个“累计%in%
”。实际上,这就是 mapply
正在做的事情。
知道data.table
的.N
特殊符号提供了一个组的行数,那么这个有用:
lapply(seq_len(.N)-1, seq_len)
# [[1]]
# integer(0)
# [[2]]
# [1] 1
# [[3]]
# [1] 1 2
# [[4]]
# [1] 1 2 3
# [[5]]
# [1] 1 2 3 4
# [[6]]
# [1] 1 2 3 4 5
这用于索引每一行之前的所有行;也就是说,在第 1 行中,没有前面的行,因此我们在 integer(0)
上建立索引;在第 5 行,我们索引 1 2 3 4
;等等
我们将其“压缩”在一起(使用 mapply
)以及每个 stim1
(然后是 stim2
值,以在原始 stim1
和 stim2
列 在 S
上编入索引(来自上一个项目符号),并对出现次数求和
最后,我们通过遍历 .SD
(使用 .SDcols
)
对两个 stim*
列执行此操作
这个过程在 Chosen
列上重复(虽然更简单)
数据
dat1 <- setDT(structure(list(subject = c(1L, 1L, 1L, 1L, 1L, 1L), stim1 = c(2L, 3L, 3L, 2L, 1L, 2L), stim2 = c(1L, 2L, 1L, 3L, 3L, 1L), Chosen = c(2L, 2L, 1L, 3L, 1L, 1L)), class = c("data.table", "data.frame"), row.names = c(NA, -6L)))
dat2 <- setDT(structure(list(subject = c(1021, 1021, 1021, 1021, 1021, 1021), stim1 = c(51L, 48L, 49L, 48L, 49L, 46L), stim2 = c(50L, 50L, 47L, 46L, 51L, 47L), Chosen = c(50L, 50L, 49L, 48L, 49L, 46L)), row.names = c(NA, -6L), class = c("data.table", "data.frame")))
假设我有大量 table 以下列
subject stim1 stim2 Chosen
1: 1 2 1 2
2: 1 3 2 2
3: 1 3 1 1
4: 1 2 3 3
5: 1 1 3 1
我正在寻找一种有效的方法(因为整个数据集很大)来改变另外两个列(按主题)
- stim1_seen, stim2_seen = 是当前 stim1 先前在 stim1 或 stim2 (stim1_seen) 或 stim2 先前在 stim1 或stim2 (stim2_seen).
- stim1_chosen, stim2_chosen= 是所有先前实例的总和,其中分别选择了当前 stim1 和当前 stim2。
期望的输出
subject stim1 stim2 Chosen stim1_chosen stim2_chosen
1: 1 2 1 2 0 0
2: 1 3 2 2 0 1
3: 1 3 1 1 0 0
4: 1 2 3 3 2 0
5: 1 1 3 1 1 1
6: 1 2 1 1 2 2
理想情况下,它会使用 data.table 或 dplyr。
这里是输出
structure(list(subject = c(1021, 1021, 1021, 1021, 1021, 1021
), stim1 = c(51L, 48L, 49L, 48L, 49L, 46L), stim2 = c(50L, 50L,
47L, 46L, 51L, 47L), Chosen = c(50L, 50L, 49L, 48L, 49L, 46L)), row.names = c(NA,
-6L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x7fc9ce8158e0>)
好的,这适用于示例数据。在我们有更多主题并且列中的值大于 1 的某些地方,运行 会很好。我假设它是一个名为 dt
data.table
对象
1.索引
使用 merge
操作更改行顺序真的很容易,所以永远不要依赖行号,而是通过 subject
创建 rowid
。 .N
是 length/number 行的 data.table 语法。
# order matters, so make a rowid
dt[, rowid := 1:.N, by=subject]
# sets orders and indexing to make it quicker
setkey(dt, subject, rowid)
2。看到 cols
需要将 stim1
和 stim2
合并到一列中。为此,请使用 melt
从宽格式变为长格式。
seen:=0:(.N-1)
然后按这些值分组以按行查找累积出现次数。但是当我们查看先验值时,我们减去 1。
然后我们进行两次合并,因为我们有兴趣将其与两个 stim cols 进行比较
# for seen, melt wide to long
dt_seen <- melt(dt,
id.vars = c("subject", "rowid"),
measure.vars = c("stim1", "stim2"))
# interested in finding occurences
dt_seen <- unique(dt_seen[, .(subject, rowid, value)])
setorder(dt_seen, rowid)
dt_seen[, seen:=0:(.N-1), by=.(subject, value)]
# merge across twice
dt <- merge(dt, dt_seen,
by.x=c("subject", "rowid", "stim1"),
by.y=c("subject", "rowid", "value"),
all.x=TRUE, sort=FALSE)
setnames(dt, "seen", "stim1_seen")
dt <- merge(dt, dt_seen,
by.x=c("subject", "rowid", "stim2"),
by.y=c("subject", "rowid", "value"),
all.x=TRUE, sort=FALSE)
setnames(dt, "seen", "stim2_seen")
dt[]
3。选择
我一直很懒惰,并且有效地完成了与第 (2) 节相同的操作,但首先过滤到 Chosen 与 stim 值匹配的行。一个一个地做而不是一起做,因为这些列是独立的。 stim1 和 stim2 的过程相同,因此可以稍微整理一下。
# turn Chosen from wide to long
dt_chosen <- melt(dt,
id.vars = c("subject", "rowid"),
measure.vars = c("Chosen"))
# interested in finding occurences
# need to expand
dt_chosen[, variable := NULL]
# going to expand the grid, so can look at e.g. value 50 for all rowids
library(tidyr)
dt_chosen[, chosen_row := 1]
dt_chosen_full <- expand(dt_chosen, nesting(subject, rowid), value) %>% setDT
# pull in the actual data and fill rest with 0's
dt_chosen_full <- merge(dt_chosen_full, dt_chosen, by=c("subject", "rowid", "value"),
all.x=TRUE)
dt_chosen_full[is.na(chosen_row), chosen_row := 0]
# use cumsum to identify now the cumulative count of these across the full row set
dt_chosen_full[, chosen := cumsum(chosen_row), by=.(subject, value)]
# as its prior, on the row itself, subtract one so the update happens after the row
dt_chosen_full[chosen_row==1, chosen := chosen-1]
# merge across twice
dt <- merge(dt, dt_chosen_full[, -"chosen_row"],
by.x=c("subject", "rowid", "stim1"),
by.y=c("subject", "rowid", "value"),
all.x=TRUE, sort=FALSE)
setnames(dt, "chosen", "stim1_chosen")
dt[is.na(stim1_chosen), stim1_chosen := 0]
dt <- merge(dt, dt_chosen_full[, -"chosen_row"],
by.x=c("subject", "rowid", "stim2"),
by.y=c("subject", "rowid", "value"),
all.x=TRUE, sort=FALSE)
setnames(dt, "chosen", "stim2_chosen")
dt[is.na(stim2_chosen), stim2_chosen := 0]
输出
dt[]
subject rowid stim2 stim1 Chosen stim1_seen stim2_seen stim1_chosen stim2_chosen
1: 1021 1 50 51 50 0 0 0 0
2: 1021 2 50 48 50 0 1 0 1
3: 1021 3 47 49 49 0 0 0 0
4: 1021 4 46 48 48 1 0 0 0
5: 1021 5 51 49 49 1 1 1 0
6: 1021 6 47 46 46 1 1 0 0
这是一个管道,在两个框架上都有演示。
dat1
是您显示一些预期输出的地方
dat1[, c("stim1_seen", "stim2_seen") :=
lapply(.SD, function(z) mapply(function(x, S) {
sum(stim1[S] %in% x | stim2[S] %in% x)
}, z, lapply(seq_len(.N)-1, seq_len))),
.SDcols = c("stim1", "stim2"), by = .(subject)
][, c("stim1_chosen", "stim2_chosen") :=
lapply(.SD, function(z) mapply(function(x, S) {
sum(Chosen[S] %in% x)
}, z, lapply(seq_len(.N)-1, seq_len))),
.SDcols = c("stim1", "stim2"), by = .(subject)]
# subject stim1 stim2 Chosen stim1_seen stim2_seen stim1_chosen stim2_chosen
# <int> <int> <int> <int> <int> <int> <int> <int>
# 1: 1 2 1 2 0 0 0 0
# 2: 1 3 2 2 0 1 0 1
# 3: 1 3 1 1 1 1 0 0
# 4: 1 2 3 3 2 2 2 0
# 5: 1 1 3 1 2 3 1 1
# 6: 1 2 1 1 3 3 2 2
dat2
是你的dput输出(不同的数据)
dat2[, c("stim1_seen", "stim2_seen") :=
lapply(.SD, function(z) mapply(function(x, S) {
sum(stim1[S] %in% x | stim2[S] %in% x)
}, z, lapply(seq_len(.N)-1, seq_len))),
.SDcols = c("stim1", "stim2"), by = .(subject)
][, c("stim1_chosen", "stim2_chosen") :=
lapply(.SD, function(z) mapply(function(x, S) {
sum(Chosen[S] %in% x)
}, z, lapply(seq_len(.N)-1, seq_len))),
.SDcols = c("stim1", "stim2"), by = .(subject)]
# subject stim1 stim2 Chosen stim1_seen stim2_seen stim1_chosen stim2_chosen
# <num> <int> <int> <int> <int> <int> <int> <int>
# 1: 1021 51 50 50 0 0 0 0
# 2: 1021 48 50 50 0 1 0 1
# 3: 1021 49 47 49 0 0 0 0
# 4: 1021 48 46 48 1 0 0 0
# 5: 1021 49 51 49 1 1 1 0
# 6: 1021 46 47 46 1 1 0 0
这里的重头戏是想做一个“累计%in%
”。实际上,这就是 mapply
正在做的事情。
知道
data.table
的.N
特殊符号提供了一个组的行数,那么这个有用:lapply(seq_len(.N)-1, seq_len) # [[1]] # integer(0) # [[2]] # [1] 1 # [[3]] # [1] 1 2 # [[4]] # [1] 1 2 3 # [[5]] # [1] 1 2 3 4 # [[6]] # [1] 1 2 3 4 5
这用于索引每一行之前的所有行;也就是说,在第 1 行中,没有前面的行,因此我们在
integer(0)
上建立索引;在第 5 行,我们索引1 2 3 4
;等等我们将其“压缩”在一起(使用
mapply
)以及每个stim1
(然后是stim2
值,以在原始stim1
和stim2
列 在S
上编入索引(来自上一个项目符号),并对出现次数求和最后,我们通过遍历
对两个.SD
(使用.SDcols
)stim*
列执行此操作这个过程在
Chosen
列上重复(虽然更简单)
数据
dat1 <- setDT(structure(list(subject = c(1L, 1L, 1L, 1L, 1L, 1L), stim1 = c(2L, 3L, 3L, 2L, 1L, 2L), stim2 = c(1L, 2L, 1L, 3L, 3L, 1L), Chosen = c(2L, 2L, 1L, 3L, 1L, 1L)), class = c("data.table", "data.frame"), row.names = c(NA, -6L)))
dat2 <- setDT(structure(list(subject = c(1021, 1021, 1021, 1021, 1021, 1021), stim1 = c(51L, 48L, 49L, 48L, 49L, 46L), stim2 = c(50L, 50L, 47L, 46L, 51L, 47L), Chosen = c(50L, 50L, 49L, 48L, 49L, 46L)), row.names = c(NA, -6L), class = c("data.table", "data.frame")))