使用 map() 和 mutate() 遍历多个变量和参数
Looping across multiple variables and parameters using map() and mutate()
我无法弄清楚如何有效映射 tbl 中的多个参数和变量以生成新变量。
在 "real" 版本中,我基本上有一个数学函数生成中心估计,我需要 运行 一系列不同参数的敏感性测试。我试图弄清楚如何在 tidyverse 中执行此操作。看起来 map() 和 mutate() 是这个问题的答案,但我遇到了麻烦。
# building the practice dataset
pracdf <- tibble(ID = letters,
p = runif(26, 100, 1000),
med.a = runif(26),
med.b = runif(26),
c = runif(26))
pracdf <- pracdf %>%
mutate(low.a = med.a * 0.8,
low.b = med.b * 0.8,
high.a = med.a * 1.2,
high.b = med.b * 1.2)
# this generates a few low/med/high values for variables
# the function
pracdf <- pracdf %>% mutate(d = p * med.a * med.b * c)
# works as expected. Now can I loop it with dynamic variable names?
f1 <- function(df, var.a) {
var.a <- enquo(var.a)
print(var.a)
d.name <- paste0("d.", quo_name(var.a))
print(d.name)
df %>% mutate(!!d.name := p * (!!var.a) * c)
}
pracdf2 <- f1(pracdf, med.a)
# works great! Eventually I want to loop through low, med, high. Start with a loop of 1
pracdf3 <- map(list(med.a), f1, df = pracdf)
# loop crashes spectacularly
pracdf3 <- map(list(med.a), ~f1, df = pracdf)
# failure
pracdf3 <- map(med.a, ~f1, df = pracdf)
# what am I doing with my life
我认为使这项任务变得困难的问题之一是当前的设置可能不是很 "tidy"。例如。 low.a
、low.b
、med.a
等似乎是我所理解的 'untidy' 列的示例。
下面是一种根本不使用 for 循环或自定义函数的可能方法(我相当确定它可能会得到改进)。关键思想是采用初始 pracdf
并扩展现有行,因此每个 "level"(即低、中和高)各占一行。这样做可以让我们在一个步骤中计算 d
,而没有针对低、中和高的 for 循环。
(为便于阅读而编辑并包含 Jens Leerssen's 建议)
library(dplyr)
library(tidyr)
set.seed(123)
pracdf <- tibble(ID = letters,
p = runif(26, 100, 1000),
a = runif(26),
b = runif(26),
c = runif(26))
levdf <- tibble(level = c("low", "med", "high"),
level_val = c(0.8, 1.0, 1.2))
tidy_df <- pracdf %>% merge(levdf) %>%
mutate(d = p * (level_val * a) * (level_val * b) * c) %>%
select(-level_val) %>% arrange(ID) %>% as_tibble()
tidy_df
#> # A tibble: 78 x 7
#> ID p a b c level d
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 a 358.8198 0.5440660 0.7989248 0.3517979 low 35.116168
#> 2 a 358.8198 0.5440660 0.7989248 0.3517979 med 54.869013
#> 3 a 358.8198 0.5440660 0.7989248 0.3517979 high 79.011379
#> 4 b 809.4746 0.5941420 0.1218993 0.1111354 low 4.169914
#> 5 b 809.4746 0.5941420 0.1218993 0.1111354 med 6.515490
#> 6 b 809.4746 0.5941420 0.1218993 0.1111354 high 9.382306
#> 7 c 468.0792 0.2891597 0.5609480 0.2436195 low 11.837821
#> 8 c 468.0792 0.2891597 0.5609480 0.2436195 med 18.496595
#> 9 c 468.0792 0.2891597 0.5609480 0.2436195 high 26.635096
#> 10 d 894.7157 0.1471136 0.2065314 0.6680556 low 11.622957
#> # ... with 68 more rows
但是,上面的结果可能不是您想要的最终数据格式。但是我们可以通过使用 tidyr::gather
和 [= 对 tidy_df
进行一些收集和传播来解决这个问题19=]。
tidy_df %>%
gather(variable, value, a, b, d) %>%
unite(level_variable, level, variable) %>%
spread(level_variable, value)
#> # A tibble: 26 x 12
#> ID p c high_a high_b high_d low_a
#> * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 a 358.8198 0.3517979 0.54406602 0.79892485 79.011379 0.54406602
#> 2 b 809.4746 0.1111354 0.59414202 0.12189926 9.382306 0.59414202
#> 3 c 468.0792 0.2436195 0.28915974 0.56094798 26.635096 0.28915974
#> 4 d 894.7157 0.6680556 0.14711365 0.20653139 26.151654 0.14711365
#> 5 e 946.4206 0.4176468 0.96302423 0.12753165 69.905442 0.96302423
#> 6 f 141.0008 0.7881958 0.90229905 0.75330786 108.778072 0.90229905
#> 7 g 575.2949 0.1028646 0.69070528 0.89504536 52.681362 0.69070528
#> 8 h 903.1771 0.4348927 0.79546742 0.37446278 168.480110 0.79546742
#> 9 i 596.2915 0.9849570 0.02461368 0.66511519 13.845603 0.02461368
#> 10 j 510.9533 0.8930511 0.47779597 0.09484066 29.775361 0.47779597
#> # ... with 16 more rows, and 5 more variables: low_b <dbl>, low_d <dbl>,
#> # med_a <dbl>, med_b <dbl>, med_d <dbl>
考虑一种矢量化方法(请原谅我的非 tidyverse 数据争论),其中可以在一次调用中处理所有新列。在随机数据之前使用 seed(888)
来重现输出:
f1 <- function(df, vars) {
df[paste0("d.", vars)] <- df$p * df[vars] * df$c
return(df)
}
newpracdf <- f1(pracdf, c("low.a","high.a","med.a","med.b","low.b","high.b"))
输出
# # A tibble: 26 x 15
# ID p med.a med.b c low.a low.b high.a high.b d.low.a d.high.a d.med.a d.med.b d.low.b d.high.b
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 122.9573 0.65746601 0.43123587 0.81314570 0.52597281 0.3449887 0.7889592 0.5174830 52.587917 78.881876 65.734897 43.115909 34.492727 51.739091
# 2 b 412.0127 0.19793909 0.77148952 0.26039116 0.15835127 0.6171916 0.2375269 0.9257874 16.988630 25.482945 21.235787 82.768834 66.215068 99.322601
# 3 c 155.1248 0.30834064 0.99850558 0.57853922 0.24667251 0.7988045 0.3700088 1.1982067 22.137823 33.206735 27.672279 89.611689 71.689351 107.534027
# 4 d 715.3769 0.85517040 0.81715464 0.84196723 0.68413632 0.6537237 1.0262045 0.9805856 412.071636 618.107455 515.089546 492.191742 393.753393 590.630090
# 5 e 790.5284 0.12617255 0.59290522 0.54879020 0.10093804 0.4743242 0.1514071 0.7114863 43.790379 65.685568 54.737973 257.222588 205.778070 308.667105
# 6 f 193.6968 0.15173488 0.93054996 0.08587380 0.12138791 0.7444400 0.1820819 1.1166600 2.019104 3.028655 2.523879 15.478286 12.382629 18.573943
# 7 g 451.6000 0.88123996 0.62858787 0.12546384 0.70499197 0.5028703 1.0574880 0.7543054 39.944473 59.916709 49.930591 35.615457 28.492365 42.738548
# 8 h 342.3741 0.09952918 0.56932309 0.10980862 0.07962334 0.4554585 0.1194350 0.6831877 2.993489 4.490234 3.741861 21.404056 17.123245 25.684867
# 9 i 143.9489 0.42407685 0.94929822 0.02754267 0.33926148 0.7594386 0.5088922 1.1391579 1.345083 2.017624 1.681353 3.763718 3.010975 4.516462
# 10 j 911.8069 0.25822441 0.08934875 0.55244369 0.20657953 0.0714790 0.3098693 0.1072185 104.058645 156.087967 130.073306 45.006930 36.005544 54.008316
# # ... with 16 more rows
我无法弄清楚如何有效映射 tbl 中的多个参数和变量以生成新变量。
在 "real" 版本中,我基本上有一个数学函数生成中心估计,我需要 运行 一系列不同参数的敏感性测试。我试图弄清楚如何在 tidyverse 中执行此操作。看起来 map() 和 mutate() 是这个问题的答案,但我遇到了麻烦。
# building the practice dataset
pracdf <- tibble(ID = letters,
p = runif(26, 100, 1000),
med.a = runif(26),
med.b = runif(26),
c = runif(26))
pracdf <- pracdf %>%
mutate(low.a = med.a * 0.8,
low.b = med.b * 0.8,
high.a = med.a * 1.2,
high.b = med.b * 1.2)
# this generates a few low/med/high values for variables
# the function
pracdf <- pracdf %>% mutate(d = p * med.a * med.b * c)
# works as expected. Now can I loop it with dynamic variable names?
f1 <- function(df, var.a) {
var.a <- enquo(var.a)
print(var.a)
d.name <- paste0("d.", quo_name(var.a))
print(d.name)
df %>% mutate(!!d.name := p * (!!var.a) * c)
}
pracdf2 <- f1(pracdf, med.a)
# works great! Eventually I want to loop through low, med, high. Start with a loop of 1
pracdf3 <- map(list(med.a), f1, df = pracdf)
# loop crashes spectacularly
pracdf3 <- map(list(med.a), ~f1, df = pracdf)
# failure
pracdf3 <- map(med.a, ~f1, df = pracdf)
# what am I doing with my life
我认为使这项任务变得困难的问题之一是当前的设置可能不是很 "tidy"。例如。 low.a
、low.b
、med.a
等似乎是我所理解的 'untidy' 列的示例。
下面是一种根本不使用 for 循环或自定义函数的可能方法(我相当确定它可能会得到改进)。关键思想是采用初始 pracdf
并扩展现有行,因此每个 "level"(即低、中和高)各占一行。这样做可以让我们在一个步骤中计算 d
,而没有针对低、中和高的 for 循环。
(为便于阅读而编辑并包含 Jens Leerssen's 建议)
library(dplyr)
library(tidyr)
set.seed(123)
pracdf <- tibble(ID = letters,
p = runif(26, 100, 1000),
a = runif(26),
b = runif(26),
c = runif(26))
levdf <- tibble(level = c("low", "med", "high"),
level_val = c(0.8, 1.0, 1.2))
tidy_df <- pracdf %>% merge(levdf) %>%
mutate(d = p * (level_val * a) * (level_val * b) * c) %>%
select(-level_val) %>% arrange(ID) %>% as_tibble()
tidy_df
#> # A tibble: 78 x 7
#> ID p a b c level d
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 a 358.8198 0.5440660 0.7989248 0.3517979 low 35.116168
#> 2 a 358.8198 0.5440660 0.7989248 0.3517979 med 54.869013
#> 3 a 358.8198 0.5440660 0.7989248 0.3517979 high 79.011379
#> 4 b 809.4746 0.5941420 0.1218993 0.1111354 low 4.169914
#> 5 b 809.4746 0.5941420 0.1218993 0.1111354 med 6.515490
#> 6 b 809.4746 0.5941420 0.1218993 0.1111354 high 9.382306
#> 7 c 468.0792 0.2891597 0.5609480 0.2436195 low 11.837821
#> 8 c 468.0792 0.2891597 0.5609480 0.2436195 med 18.496595
#> 9 c 468.0792 0.2891597 0.5609480 0.2436195 high 26.635096
#> 10 d 894.7157 0.1471136 0.2065314 0.6680556 low 11.622957
#> # ... with 68 more rows
但是,上面的结果可能不是您想要的最终数据格式。但是我们可以通过使用 tidyr::gather
和 [= 对 tidy_df
进行一些收集和传播来解决这个问题19=]。
tidy_df %>%
gather(variable, value, a, b, d) %>%
unite(level_variable, level, variable) %>%
spread(level_variable, value)
#> # A tibble: 26 x 12
#> ID p c high_a high_b high_d low_a
#> * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 a 358.8198 0.3517979 0.54406602 0.79892485 79.011379 0.54406602
#> 2 b 809.4746 0.1111354 0.59414202 0.12189926 9.382306 0.59414202
#> 3 c 468.0792 0.2436195 0.28915974 0.56094798 26.635096 0.28915974
#> 4 d 894.7157 0.6680556 0.14711365 0.20653139 26.151654 0.14711365
#> 5 e 946.4206 0.4176468 0.96302423 0.12753165 69.905442 0.96302423
#> 6 f 141.0008 0.7881958 0.90229905 0.75330786 108.778072 0.90229905
#> 7 g 575.2949 0.1028646 0.69070528 0.89504536 52.681362 0.69070528
#> 8 h 903.1771 0.4348927 0.79546742 0.37446278 168.480110 0.79546742
#> 9 i 596.2915 0.9849570 0.02461368 0.66511519 13.845603 0.02461368
#> 10 j 510.9533 0.8930511 0.47779597 0.09484066 29.775361 0.47779597
#> # ... with 16 more rows, and 5 more variables: low_b <dbl>, low_d <dbl>,
#> # med_a <dbl>, med_b <dbl>, med_d <dbl>
考虑一种矢量化方法(请原谅我的非 tidyverse 数据争论),其中可以在一次调用中处理所有新列。在随机数据之前使用 seed(888)
来重现输出:
f1 <- function(df, vars) {
df[paste0("d.", vars)] <- df$p * df[vars] * df$c
return(df)
}
newpracdf <- f1(pracdf, c("low.a","high.a","med.a","med.b","low.b","high.b"))
输出
# # A tibble: 26 x 15
# ID p med.a med.b c low.a low.b high.a high.b d.low.a d.high.a d.med.a d.med.b d.low.b d.high.b
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 122.9573 0.65746601 0.43123587 0.81314570 0.52597281 0.3449887 0.7889592 0.5174830 52.587917 78.881876 65.734897 43.115909 34.492727 51.739091
# 2 b 412.0127 0.19793909 0.77148952 0.26039116 0.15835127 0.6171916 0.2375269 0.9257874 16.988630 25.482945 21.235787 82.768834 66.215068 99.322601
# 3 c 155.1248 0.30834064 0.99850558 0.57853922 0.24667251 0.7988045 0.3700088 1.1982067 22.137823 33.206735 27.672279 89.611689 71.689351 107.534027
# 4 d 715.3769 0.85517040 0.81715464 0.84196723 0.68413632 0.6537237 1.0262045 0.9805856 412.071636 618.107455 515.089546 492.191742 393.753393 590.630090
# 5 e 790.5284 0.12617255 0.59290522 0.54879020 0.10093804 0.4743242 0.1514071 0.7114863 43.790379 65.685568 54.737973 257.222588 205.778070 308.667105
# 6 f 193.6968 0.15173488 0.93054996 0.08587380 0.12138791 0.7444400 0.1820819 1.1166600 2.019104 3.028655 2.523879 15.478286 12.382629 18.573943
# 7 g 451.6000 0.88123996 0.62858787 0.12546384 0.70499197 0.5028703 1.0574880 0.7543054 39.944473 59.916709 49.930591 35.615457 28.492365 42.738548
# 8 h 342.3741 0.09952918 0.56932309 0.10980862 0.07962334 0.4554585 0.1194350 0.6831877 2.993489 4.490234 3.741861 21.404056 17.123245 25.684867
# 9 i 143.9489 0.42407685 0.94929822 0.02754267 0.33926148 0.7594386 0.5088922 1.1391579 1.345083 2.017624 1.681353 3.763718 3.010975 4.516462
# 10 j 911.8069 0.25822441 0.08934875 0.55244369 0.20657953 0.0714790 0.3098693 0.1072185 104.058645 156.087967 130.073306 45.006930 36.005544 54.008316
# # ... with 16 more rows