使用 dplyr 总结嵌套列表的第一个元素(二维数组?)
use dplyr to summarize first element of a nested list (2-d array?)
我正在尝试了解 dplyr 在汇总 tibble 中的嵌套列表时的适当用法。
结构如下:
> glimpse(mydata)
Rows: 1,000
Columns: 3
$ meta <df[,6]> <data.frame[40 x 6]>
$ independent_variable <list> [<"A", "B", "B", "B", "A", "A", "B", "A…
$ dependent_variables <df[,4]> <data.frame[40 x 4]>
> head(mydata$independent_variable)
[[1]]
[,1] [,2] [,3] [,4]
[1,] "A" "FALSE" "5" NA
[2,] "B" "FALSE" "5" "NA"
[3,] "B" "FALSE" "5" "NA"
[4,] "B" "FALSE" "5" "NA"
[5,] "A" "FALSE" "13" "NA"
[6,] "A" "FALSE" "5" "NA"
[7,] "B" "FALSE" "12" "NA"
[8,] "A" "FALSE" "133 "NA"
[9,] "A" "FALSE" "131 "NA"
[10,] "A" "TRUE" "0" "NA"
[[2]]
[,1] [,2] [,3] [,4]
[1,] "A" "FALSE" "77" NA
[2,] "B" "FALSE" NA "NA"
[3,] "B" "FALSE" NA "NA"
[4,] "B" "FALSE" NA "NA"
[5,] "B" "FALSE" NA "NA"
[6,] "A" "TRUE" "1" "NA"
independent_variable 是 N x 4 列表的 1000 个条目(也就是说,所有 1000 个条目都有 4 列,行数不等。第一列是我目前唯一有兴趣查看的列,每个元素只能是 "A" 或 "B")。我想计算 1000 个中每个条目中 "A" 的数量,并为 1000 个条目中的每个条目取回该值。
看来我应该使用 purrr,但我不确定如何在 dplyr 中构建它
这是使用 purrr 的方法:
library(purrr)
library(dplyr)
# my example data
tmp = list(cbind(c("A","A","B"),1),cbind(c("B","A","B"),2))
# define a summary function
count_A = function(x){
x %>%
as.data.frame() %>% # needed as the input data is of type 'matrix'
select(V1) %>% # the default column name for column 1
filter(V1 == "A") %>%
ungroup() %>% # unnecessary, but clear you are summarising the whole df
summarise(num_A = n())
}
# test summary function
count_A(tmp[[1]])
# apply function to every element of list
map(tmp, count_A)
在此模式中,您的摘要函数可以是采用单个参数和 returns 所需结果的任何函数。如果该函数在应用于列表的第一个元素时工作正常(请参阅代码,我测试了我的摘要函数),那么您可以预期 map 会将函数应用于列表的每个元素。
我正在尝试了解 dplyr 在汇总 tibble 中的嵌套列表时的适当用法。
结构如下:
> glimpse(mydata)
Rows: 1,000
Columns: 3
$ meta <df[,6]> <data.frame[40 x 6]>
$ independent_variable <list> [<"A", "B", "B", "B", "A", "A", "B", "A…
$ dependent_variables <df[,4]> <data.frame[40 x 4]>
> head(mydata$independent_variable)
[[1]]
[,1] [,2] [,3] [,4]
[1,] "A" "FALSE" "5" NA
[2,] "B" "FALSE" "5" "NA"
[3,] "B" "FALSE" "5" "NA"
[4,] "B" "FALSE" "5" "NA"
[5,] "A" "FALSE" "13" "NA"
[6,] "A" "FALSE" "5" "NA"
[7,] "B" "FALSE" "12" "NA"
[8,] "A" "FALSE" "133 "NA"
[9,] "A" "FALSE" "131 "NA"
[10,] "A" "TRUE" "0" "NA"
[[2]]
[,1] [,2] [,3] [,4]
[1,] "A" "FALSE" "77" NA
[2,] "B" "FALSE" NA "NA"
[3,] "B" "FALSE" NA "NA"
[4,] "B" "FALSE" NA "NA"
[5,] "B" "FALSE" NA "NA"
[6,] "A" "TRUE" "1" "NA"
independent_variable 是 N x 4 列表的 1000 个条目(也就是说,所有 1000 个条目都有 4 列,行数不等。第一列是我目前唯一有兴趣查看的列,每个元素只能是 "A" 或 "B")。我想计算 1000 个中每个条目中 "A" 的数量,并为 1000 个条目中的每个条目取回该值。
看来我应该使用 purrr,但我不确定如何在 dplyr 中构建它
这是使用 purrr 的方法:
library(purrr)
library(dplyr)
# my example data
tmp = list(cbind(c("A","A","B"),1),cbind(c("B","A","B"),2))
# define a summary function
count_A = function(x){
x %>%
as.data.frame() %>% # needed as the input data is of type 'matrix'
select(V1) %>% # the default column name for column 1
filter(V1 == "A") %>%
ungroup() %>% # unnecessary, but clear you are summarising the whole df
summarise(num_A = n())
}
# test summary function
count_A(tmp[[1]])
# apply function to every element of list
map(tmp, count_A)
在此模式中,您的摘要函数可以是采用单个参数和 returns 所需结果的任何函数。如果该函数在应用于列表的第一个元素时工作正常(请参阅代码,我测试了我的摘要函数),那么您可以预期 map 会将函数应用于列表的每个元素。