使用 tidyr::map 或 dplyr 管道输出多个变量

Question

我正在

建立我之前关于 SO 的问题

我想创建六个相关矩阵，让我分析过去三年中花费的美元和销售数量的相关性演变。本质上，我正在寻找 2 X [3X3] 类型列表。到目前为止，我可以通过对每个 Product_Type 和 Quantity 进行单独调用来使用 tidyr::map() 创建 3X3 列表，但是我在一次矢量化调用中没有成功。正如您将在下面看到的，我的代码中有很多冗余。

这是我的数据：

dput(DFile_Gather)
structure(list(Order.ID = c(456, 567, 345, 567, 2345, 8910, 8910, 
789, 678, 456, 345, 8910, 234, 1234, 456), Calendar.Year = c(2015, 
2015, 2016, 2015, 2017, 2015, 2015, 2016, 2015, 2015, 2016, 2015, 
2016, 2016, 2015), Product_Type = c("Insurance", "Insurance", 
"Tire", "Tire", "Rental", "Insurance", "Servicing", "Truck", 
"Tire", "Servicing", "Truck", "Rental", "Car", "Servicing", "Tire"
), Mexican_Pesos = c(35797.32, 1916.25, 19898.62, 0, 22548.314011, 
686.88, 0, 0, 0, 0, 0, 0, 0, 0, 203276.65683), Quantity = c(0.845580721440663, 
0.246177053792905, 2.10266268677851, 1.89588258358317, 0.00223077008050406, 
0.454640961140588, 1.92032156606277, 0.475872861771994, 0.587966920885798, 
0.721024745664671, 0.696609684682582, 0.0441522564791413, 0.872232778060772, 
0.343347997825813, 0.716224049425646)), .Names = c("Order.ID", 
"Calendar.Year", "Product_Type", "Mexican_Pesos", "Quantity"), row.names = c(54L, 
55L, 13L, 15L, 50L, 58L, 28L, 37L, 16L, 24L, 33L, 48L, 2L, 29L, 
14L), class = "data.frame")

这是我第一次迭代的代码：即计算 Product_Type

的相关矩阵

DFile_Spread_PType <- spread(DFile_Gather[-length(DFile_Gather)],key = Product_Type, value = Mexican_Pesos)

DFile<-DFile_Spread_PType
CYear <- unique(DFile$Calendar.Year)
DFile_Corr_PType <- purrr::map(CYear, ~ dplyr::filter(DFile, Calendar.Year == .)) %>% 
  purrr::map(~ cor(.[,colnames(DFile)[3:length(colnames(DFile))]]) ) %>%
  structure(., names = CYear)

最后，这是我按数量对相关矩阵进行第二次迭代的代码：

DFile_Spread_Qty <- spread(subset( DFile_Gather, select = -Mexican_Pesos),key = Product_Type, value = Quantity)
DFile<-DFile_Spread_Qty
DFile_Corr_Qty <- purrr::map(CYear, ~ dplyr::filter(DFile, Calendar.Year == .)) %>% 
  purrr::map(~ cor(.[,colnames(DFile)[3:length(colnames(DFile))]]) ) %>%
  structure(., names = CYear)

正如你在上面看到的，冗余太多，代码看起来很笨拙。如果有人能帮助我，我将不胜感激。我特别在寻找两件事：

1) 通过没有任何冗余来完成我在上面所做的事情

2) 如果可能，获取 2X3X3 的列表，即顶层的 Quantity 和 Product_Type，然后是引用上述各项的 3x3 相关矩阵。

我在 SO 上搜索了类似的主题，但我认为没有类似主题的讨论帖。

提前致谢。

Answer 1

以下无冗余，不使用包。使 Product_Type 成为一个因素，然后按年份拆分，给出年份列表 s。现在，在使用 tapply 和运行 cor.[=18 的每个内部迭代中，在 s 和 Values 上使用双 Map 转换为宽格式=]

DG <- transform(DFile_Gather, Product_Type = factor(Product_Type))
s <- split(DG, DG$Calendar.Year)
Values <- c("Mexican_Pesos", "Quantity")
By <- c("Order.ID", "Product_Type")
res <- Map(function(v) Map(function(s) cor(tapply(s[, v], s[By], c)), s), Values)

Answer 2

要获得每个响应变量和年份组合的 Product_Type 之间的相关性，您可以将数据集重塑为方便的格式，将数据集拆分为因素组合列表，并通过以下方式获得相关性map 在 dplyr::select 的帮助下选择列。但是，这不是 return 列表的列表。

library(purrr)
library(tidyr)

DFile_Gather %>%
    gather(type, value, Mexican_Pesos:Quantity) %>%
    spread(Product_Type, value) %>%
    split(list(.$Calendar.Year, .$type)) %>%
    map(~cor(dplyr::select(.x, Car:Truck)))

列表的列表采取了额外的步骤，因为我必须首先 split 通过响应变量，然后在该列表的每个元素中，通过 Calendar.Year split。然后我使用 at_depth 而不是 map 来计算列表中每个列表的 Product_Type 之间的相关性。在最低级别工作由 at_depth 中的 2 指示。

DFile_Gather %>%
    gather(type, value, Mexican_Pesos:Quantity) %>%
    spread(Product_Type, value) %>%
    split(.$type) %>%
    map(~split(.x, .x$Calendar.Year)) %>%
    at_depth(2, ~cor(dplyr::select(.x, Car:Truck)))

收集和传播后的临时数据集的前几个 rows/columns 如下所示：

   Order.ID Calendar.Year          type       Car    Insurance       Rental
1       234          2016 Mexican_Pesos 0.0000000           NA           NA
2       234          2016      Quantity 0.8722328           NA           NA
3       345          2016 Mexican_Pesos        NA           NA           NA
4       345          2016      Quantity        NA           NA           NA
5       456          2015 Mexican_Pesos        NA 3.579732e+04           NA
6       456          2015      Quantity        NA 8.455807e-01           NA
...

使用 tidyr::map 或 dplyr 管道输出多个变量

Pipe output of more than one variable using tidyr::map or dplyr

r

dplyr

tidyr