R data.table 列数可变
R data.table with variable number of columns
对于数据集中的每个学生,可能已经收集了一组特定的分数。我们想计算每个学生的平均值,但只使用与该学生密切相关的列中的分数。
计算中所需的列因每一行而异。我已经想出如何使用常用工具在 R 中编写它,但我正在尝试使用 data.table 重写,部分原因是为了好玩,但部分原因是希望在这个小项目中取得成功,这可能会导致需要计算很多行。
这是"choose a specific column set for each row problem."
的一个小例子
set.seed(123234)
## Suppose these are 10 students in various grades
dat <- data.frame(id = 1:10, grade = rep(3:7, by = 2),
A = sample(c(1:5, 9), 10, replace = TRUE),
B = sample(c(1:5, 9), 10, replace = TRUE),
C = sample(c(1:5, 9), 10, replace = TRUE),
D = sample(c(1:5, 9), 10, replace = TRUE))
## 9 is a marker for missing value, there might also be
## NAs in real data, and those are supposed to be regarded
## differently in some exercises
## Students in various grades are administered different
## tests. A data structure gives the grade to test linkage.
## The letters are column names in dat
lookup <- list("3" = c("A", "B"),
"4" = c("A", "C"),
"5" = c("B", "C", "D"),
"6" = c("A", "B", "C", "D"),
"7" = c("C", "D"),
"8" = c("C"))
## wrapper around that lookup because I kept getting confused
getLookup <- function(grade){
lookup[[as.character(grade)]]
}
## Function that receives one row (named vector)
## from data frame and chooses columns and makes calculation
getMean <- function(arow, lookup){
scores <- arow[getLookup(arow["grade"])]
mean(scores[scores != 9], na.rm = TRUE)
}
stuscores <- apply(dat, 1, function(x) getMean(x, lookup))
result <- data.frame(dat, stuscores)
result
## If the data is 1000s of thousands of rows,
## I will wish I could use data.table to do that.
## Client will want students sorted by state, district, classroom,
## etc.
## However, am stumped on how to specify the adjustable
## column-name chooser
library(data.table)
DT <- data.table(dat)
## How to write call to getMean correctly?
## Want to do this for each participant (no grouping)
setkey(DT, id)
所需的输出是相应列的学生平均值,如下所示:
> result
id grade A B C D stuscores
1 1 3 9 9 1 4 NaN
2 2 4 5 4 1 5 3.0
3 3 5 1 3 5 9 4.0
4 4 6 5 2 4 5 4.0
5 5 7 9 1 1 3 2.0
6 6 3 3 3 4 3 3.0
7 7 4 9 2 9 2 NaN
8 8 5 3 9 2 9 2.0
9 9 6 2 3 2 5 3.0
10 10 7 3 2 4 1 2.5
然后呢?到目前为止我写了很多错误...
我没有在数据table中找到任何示例,其中每行计算中使用的列本身就是一个变量,谢谢您的建议。
我不是在要求任何人为我编写代码,我是在征求有关如何开始解决这个问题的建议。
首先,在使用 sample
(每次 运行 时设置随机种子)等函数创建可重现示例时,您应该使用 set.seed
。
其次,不是遍历每一行,而是遍历 lookup
列表,它总是小于数据(小很多倍)并将其与 rowMeans
.你也可以用 base R 来做,但是你要求一个 data.table
解决方案所以这里是(为了这个解决方案的目的,我已经将所有 9 转换为 NA
s,但是你可以尝试概括这也适用于您的具体情况)
所以使用 set.seed(123)
,你的函数给出
apply(dat, 1, function(x) getMean(x, lookup))
# [1] 2.000000 5.000000 4.666667 4.500000 2.500000 1.000000 4.000000 2.333333 2.500000 1.500000
这是一个可能的 data.table
应用程序,它 运行 仅在 lookup
列表上(for
列表上的循环在 R 中非常有效,顺便说一句,请参阅 here)
## convert all 9 values to NAs
is.na(dat) <- dat == 9L
## convert your original data to `data.table`,
## there is no need in additional copy of the data if the data is huge
setDT(dat)
## loop only over the list
for(i in names(lookup)) {
dat[grade == i, res := rowMeans(as.matrix(.SD[, lookup[[i]], with = FALSE]), na.rm = TRUE)]
}
dat
# id grade A B C D res
# 1: 1 3 2 NA NA NA 2.000000
# 2: 2 4 5 3 5 NA 5.000000
# 3: 3 5 3 5 4 5 4.666667
# 4: 4 6 NA 4 NA 5 4.500000
# 5: 5 7 NA 1 4 1 2.500000
# 6: 6 3 1 NA 5 3 1.000000
# 7: 7 4 4 2 4 5 4.000000
# 8: 8 5 NA 1 4 2 2.333333
# 9: NA 6 4 2 2 2 2.500000
# 10: 10 7 3 NA 1 2 1.500000
可能,这可以利用set
来改进,但我目前想不出一个好的方法。
P.S.
根据@A运行的建议,请看一下他自己写的小插曲here以熟悉:=
运算符,.SD
, with = FALSE
, 等等
这是另一种 data.table
方法,使用 melt.data.table
(需要 data.table
1.9.5+),然后在 data.table
之间加入:
DT_m <- setkey(melt.data.table(DT, c("id", "grade"), value.name = "score"), grade, variable)
lookup_dt <- data.table(grade = rep(as.integer(names(lookup)), lengths(lookup)),
variable = unlist(lookup), key = "grade,variable")
score_summary <- setkey(DT_m[lookup_dt, nomatch = 0L,
.(res = mean(score[score != 9], na.rm = TRUE)), by = id], id)
setkey(DT, id)[score_summary, res := res]
# id grade A B C D mean_score
# 1: 1 3 9 9 1 4 NaN
# 2: 2 4 5 4 1 5 3.0
# 3: 3 5 1 3 5 9 4.0
# 4: 4 6 5 2 4 5 4.0
# 5: 5 7 9 1 1 3 2.0
# 6: 6 3 3 3 4 3 3.0
# 7: 7 4 9 2 9 2 NaN
# 8: 8 5 3 9 2 9 2.0
# 9: 9 6 2 3 2 5 3.0
#10: 10 7 3 2 4 1 2.5
它更冗长,但速度快了一倍多:
microbenchmark(da_method(), nk_method(), times = 1000)
#Unit: milliseconds
# expr min lq mean median uq max neval
# da_method() 17.465893 17.845689 19.249615 18.079206 18.337346 181.76369 1000
# nk_method() 7.047405 7.282276 7.757005 7.489351 7.667614 20.30658 1000
对于数据集中的每个学生,可能已经收集了一组特定的分数。我们想计算每个学生的平均值,但只使用与该学生密切相关的列中的分数。
计算中所需的列因每一行而异。我已经想出如何使用常用工具在 R 中编写它,但我正在尝试使用 data.table 重写,部分原因是为了好玩,但部分原因是希望在这个小项目中取得成功,这可能会导致需要计算很多行。
这是"choose a specific column set for each row problem."
的一个小例子set.seed(123234)
## Suppose these are 10 students in various grades
dat <- data.frame(id = 1:10, grade = rep(3:7, by = 2),
A = sample(c(1:5, 9), 10, replace = TRUE),
B = sample(c(1:5, 9), 10, replace = TRUE),
C = sample(c(1:5, 9), 10, replace = TRUE),
D = sample(c(1:5, 9), 10, replace = TRUE))
## 9 is a marker for missing value, there might also be
## NAs in real data, and those are supposed to be regarded
## differently in some exercises
## Students in various grades are administered different
## tests. A data structure gives the grade to test linkage.
## The letters are column names in dat
lookup <- list("3" = c("A", "B"),
"4" = c("A", "C"),
"5" = c("B", "C", "D"),
"6" = c("A", "B", "C", "D"),
"7" = c("C", "D"),
"8" = c("C"))
## wrapper around that lookup because I kept getting confused
getLookup <- function(grade){
lookup[[as.character(grade)]]
}
## Function that receives one row (named vector)
## from data frame and chooses columns and makes calculation
getMean <- function(arow, lookup){
scores <- arow[getLookup(arow["grade"])]
mean(scores[scores != 9], na.rm = TRUE)
}
stuscores <- apply(dat, 1, function(x) getMean(x, lookup))
result <- data.frame(dat, stuscores)
result
## If the data is 1000s of thousands of rows,
## I will wish I could use data.table to do that.
## Client will want students sorted by state, district, classroom,
## etc.
## However, am stumped on how to specify the adjustable
## column-name chooser
library(data.table)
DT <- data.table(dat)
## How to write call to getMean correctly?
## Want to do this for each participant (no grouping)
setkey(DT, id)
所需的输出是相应列的学生平均值,如下所示:
> result
id grade A B C D stuscores
1 1 3 9 9 1 4 NaN
2 2 4 5 4 1 5 3.0
3 3 5 1 3 5 9 4.0
4 4 6 5 2 4 5 4.0
5 5 7 9 1 1 3 2.0
6 6 3 3 3 4 3 3.0
7 7 4 9 2 9 2 NaN
8 8 5 3 9 2 9 2.0
9 9 6 2 3 2 5 3.0
10 10 7 3 2 4 1 2.5
然后呢?到目前为止我写了很多错误...
我没有在数据table中找到任何示例,其中每行计算中使用的列本身就是一个变量,谢谢您的建议。
我不是在要求任何人为我编写代码,我是在征求有关如何开始解决这个问题的建议。
首先,在使用 sample
(每次 运行 时设置随机种子)等函数创建可重现示例时,您应该使用 set.seed
。
其次,不是遍历每一行,而是遍历 lookup
列表,它总是小于数据(小很多倍)并将其与 rowMeans
.你也可以用 base R 来做,但是你要求一个 data.table
解决方案所以这里是(为了这个解决方案的目的,我已经将所有 9 转换为 NA
s,但是你可以尝试概括这也适用于您的具体情况)
所以使用 set.seed(123)
,你的函数给出
apply(dat, 1, function(x) getMean(x, lookup))
# [1] 2.000000 5.000000 4.666667 4.500000 2.500000 1.000000 4.000000 2.333333 2.500000 1.500000
这是一个可能的 data.table
应用程序,它 运行 仅在 lookup
列表上(for
列表上的循环在 R 中非常有效,顺便说一句,请参阅 here)
## convert all 9 values to NAs
is.na(dat) <- dat == 9L
## convert your original data to `data.table`,
## there is no need in additional copy of the data if the data is huge
setDT(dat)
## loop only over the list
for(i in names(lookup)) {
dat[grade == i, res := rowMeans(as.matrix(.SD[, lookup[[i]], with = FALSE]), na.rm = TRUE)]
}
dat
# id grade A B C D res
# 1: 1 3 2 NA NA NA 2.000000
# 2: 2 4 5 3 5 NA 5.000000
# 3: 3 5 3 5 4 5 4.666667
# 4: 4 6 NA 4 NA 5 4.500000
# 5: 5 7 NA 1 4 1 2.500000
# 6: 6 3 1 NA 5 3 1.000000
# 7: 7 4 4 2 4 5 4.000000
# 8: 8 5 NA 1 4 2 2.333333
# 9: NA 6 4 2 2 2 2.500000
# 10: 10 7 3 NA 1 2 1.500000
可能,这可以利用set
来改进,但我目前想不出一个好的方法。
P.S.
根据@A运行的建议,请看一下他自己写的小插曲here以熟悉:=
运算符,.SD
, with = FALSE
, 等等
这是另一种 data.table
方法,使用 melt.data.table
(需要 data.table
1.9.5+),然后在 data.table
之间加入:
DT_m <- setkey(melt.data.table(DT, c("id", "grade"), value.name = "score"), grade, variable)
lookup_dt <- data.table(grade = rep(as.integer(names(lookup)), lengths(lookup)),
variable = unlist(lookup), key = "grade,variable")
score_summary <- setkey(DT_m[lookup_dt, nomatch = 0L,
.(res = mean(score[score != 9], na.rm = TRUE)), by = id], id)
setkey(DT, id)[score_summary, res := res]
# id grade A B C D mean_score
# 1: 1 3 9 9 1 4 NaN
# 2: 2 4 5 4 1 5 3.0
# 3: 3 5 1 3 5 9 4.0
# 4: 4 6 5 2 4 5 4.0
# 5: 5 7 9 1 1 3 2.0
# 6: 6 3 3 3 4 3 3.0
# 7: 7 4 9 2 9 2 NaN
# 8: 8 5 3 9 2 9 2.0
# 9: 9 6 2 3 2 5 3.0
#10: 10 7 3 2 4 1 2.5
它更冗长,但速度快了一倍多:
microbenchmark(da_method(), nk_method(), times = 1000)
#Unit: milliseconds
# expr min lq mean median uq max neval
# da_method() 17.465893 17.845689 19.249615 18.079206 18.337346 181.76369 1000
# nk_method() 7.047405 7.282276 7.757005 7.489351 7.667614 20.30658 1000