使用 dplyr、自定义函数或 purr 的多条件 if-else
Multiple condition if-else using dplyr, custom function, or purr
我有一个与以下结构类似的数据框:
set.seed(123)
df<-data_frame(SectionName = rep(letters[1:2], 50),
TimeSpentSeconds = sample(0:360, 100, replace = TRUE),
Correct = sample(0:1, 100, replace = TRUE))
我想通过获取特定范围内(小于 30、30-60 之间、60-90 之间、...、大于 180)的所有 TimeSpentSeconds 值来总结此数据框,将时间标记为这些范围,按 SectionName 对它们进行分组,并找到正确列的总和,以便生成的数据框看起来(类似于)如下所示:
TimeGroup SectionName Correct
<fct> <chr> <int>
1 LessThan30Secs a 2
2 LessThan30Secs b 3
3 30-60 Seconds a 4
4 30-60 Seconds b 3
5 60-90 Seconds a 2
6 60-90 Seconds b 3
7 90-120 Seconds a 4
8 90-120 Seconds b 0
9 120-150 Seconds a 4
10 120-150 Seconds b 0
11 150-180 Seconds a 1
12 150-180 Seconds b 2
13 GreaterThan180Seconds a 11
14 GreaterThan180Seconds b 11
我能够使用以下 if-else 代码成功地做到这一点,我在其中将所有时间都变异到一个带有适当标签、分组和总结的新列中:
x <- c("LessThan30Secs", "30-60 Seconds", "60-90 Seconds","90-120 Seconds",
"120-150 Seconds", "150-180 Seconds", "GreaterThan180Seconds")
df %>%
mutate(TimeGroup = if_else(TimeSpentSeconds >= 0 & TimeSpentSeconds <= 30, "LessThan30Secs",
if_else(TimeSpentSeconds > 30 & TimeSpentSeconds <= 60, "30-60 Seconds",
if_else(TimeSpentSeconds > 60 & TimeSpentSeconds <= 90, "60-90 Seconds",
if_else(TimeSpentSeconds > 90 & TimeSpentSeconds <= 120, "90-120 Seconds",
if_else(TimeSpentSeconds > 120 & TimeSpentSeconds <= 150, "120-150 Seconds",
if_else(TimeSpentSeconds > 150 & TimeSpentSeconds <= 180, "150-180 Seconds",
if_else(TimeSpentSeconds > 180, "GreaterThan180Seconds", "")))))))) %>%
mutate(TimeGroup = factor(TimeGroup, levels = x)) %>%
arrange(TimeGroup) %>%
group_by(TimeGroup, SectionName) %>%
summarise(Correct = sum(Correct))
但是,必须有更好的方法来做到这一点。我考虑过编写一个函数,但由于我不擅长编写函数,所以没能走多远。
有没有人对通过我没有想到的 dplyr 方法以更优雅的方式完成同样的输出有任何想法,编写自定义函数可能在某些时候利用 purrr 包,或其他一些 r 函数?
``` r
library(tidyverse)
set.seed(123)
df<-data_frame(SectionName = rep(letters[1:2], 50),
TimeSpentSeconds = sample(0:360, 100, replace = TRUE),
Correct = sample(0:1, 100, replace = TRUE))
time_spent_range <- function(value, start, end, interval) {
end <- end + (end%%interval) # make sure the end value is divisible by the interval
bins_start <- seq(start, end - interval, by = interval)
bins_end <- seq(start + interval, end, by = interval)
bins_tibble <- tibble(bin_start = bins_start,
bin_end = bins_end) %>%
mutate(in_bin = if_else((value > bin_start|(value == 0 & bin_start == 0))
& value <= bin_end,
1,
0)) %>%
filter(in_bin == 1)
bin <- paste0(as.character(bins_tibble$bin_start[1]),
'-',
as.character(bins_tibble$bin_end[1]),
' Seconds')
return(bin)
}
df %>%
mutate(TimeGroup = map_chr(TimeSpentSeconds, time_spent_range, start = 0, end = max(df$TimeSpentSeconds) , interval = 30))
#> # A tibble: 100 x 4
#> SectionName TimeSpentSeconds Correct TimeGroup
#> <chr> <int> <int> <chr>
#> 1 a 103 1 90-120 Seconds
#> 2 b 284 0 270-300 Seconds
#> 3 a 147 0 120-150 Seconds
#> 4 b 318 1 300-330 Seconds
#> 5 a 339 0 330-360 Seconds
#> 6 b 16 1 0-30 Seconds
#> 7 a 190 1 180-210 Seconds
#> 8 b 322 1 300-330 Seconds
#> 9 a 199 0 180-210 Seconds
#> 10 b 164 0 150-180 Seconds
#> # ... with 90 more rows
```
由 reprex 创建于 2018-08-26
包 (v0.2.0).
我们可以使用 cut
(或 findInterval
)而不是多个嵌套的 ifelse
语句
轻松地做到这一点
lbls <- c('LessThan30secs', '30-60 Seconds', '60-90 Seconds',
'90-120 Seconds', '120-150 Seconds', '150-180 Seconds', 'GreaterThan180Seconds')
df %>%
group_by(TimeGroup = cut(TimeSpentSeconds,
breaks = c(seq(0, 180, by = 30), Inf), labels = lbls),
SectionName) %>%
summarise(Correct = sum(Correct)) %>%
na.omit
case_when()
会为所欲为。它是嵌套 ifelse()
语句的简洁替代品。
library(dplyr)
mutate(df,
TimeGroup = case_when(
TimeSpentSeconds <= 30 ~ "30 Seconds or less",
TimeSpentSeconds <= 60 ~ "31-60 Seconds",
TimeSpentSeconds <= 90 ~ "61-90 Seconds",
TimeSpentSeconds <= 120 ~ "91-120 Seconds",
TimeSpentSeconds <= 150 ~ "121-150 Seconds",
TimeSpentSeconds <= 180 ~ "151-180 Seconds",
TimeSpentSeconds > 180 ~ "Greater Than 180 Seconds",
TRUE ~ NA_character_)
)
最后一个参数是对不符合任何条件的记录的全部捕获,例如时间是否小于 0 秒。
我有一个与以下结构类似的数据框:
set.seed(123)
df<-data_frame(SectionName = rep(letters[1:2], 50),
TimeSpentSeconds = sample(0:360, 100, replace = TRUE),
Correct = sample(0:1, 100, replace = TRUE))
我想通过获取特定范围内(小于 30、30-60 之间、60-90 之间、...、大于 180)的所有 TimeSpentSeconds 值来总结此数据框,将时间标记为这些范围,按 SectionName 对它们进行分组,并找到正确列的总和,以便生成的数据框看起来(类似于)如下所示:
TimeGroup SectionName Correct
<fct> <chr> <int>
1 LessThan30Secs a 2
2 LessThan30Secs b 3
3 30-60 Seconds a 4
4 30-60 Seconds b 3
5 60-90 Seconds a 2
6 60-90 Seconds b 3
7 90-120 Seconds a 4
8 90-120 Seconds b 0
9 120-150 Seconds a 4
10 120-150 Seconds b 0
11 150-180 Seconds a 1
12 150-180 Seconds b 2
13 GreaterThan180Seconds a 11
14 GreaterThan180Seconds b 11
我能够使用以下 if-else 代码成功地做到这一点,我在其中将所有时间都变异到一个带有适当标签、分组和总结的新列中:
x <- c("LessThan30Secs", "30-60 Seconds", "60-90 Seconds","90-120 Seconds",
"120-150 Seconds", "150-180 Seconds", "GreaterThan180Seconds")
df %>%
mutate(TimeGroup = if_else(TimeSpentSeconds >= 0 & TimeSpentSeconds <= 30, "LessThan30Secs",
if_else(TimeSpentSeconds > 30 & TimeSpentSeconds <= 60, "30-60 Seconds",
if_else(TimeSpentSeconds > 60 & TimeSpentSeconds <= 90, "60-90 Seconds",
if_else(TimeSpentSeconds > 90 & TimeSpentSeconds <= 120, "90-120 Seconds",
if_else(TimeSpentSeconds > 120 & TimeSpentSeconds <= 150, "120-150 Seconds",
if_else(TimeSpentSeconds > 150 & TimeSpentSeconds <= 180, "150-180 Seconds",
if_else(TimeSpentSeconds > 180, "GreaterThan180Seconds", "")))))))) %>%
mutate(TimeGroup = factor(TimeGroup, levels = x)) %>%
arrange(TimeGroup) %>%
group_by(TimeGroup, SectionName) %>%
summarise(Correct = sum(Correct))
但是,必须有更好的方法来做到这一点。我考虑过编写一个函数,但由于我不擅长编写函数,所以没能走多远。
有没有人对通过我没有想到的 dplyr 方法以更优雅的方式完成同样的输出有任何想法,编写自定义函数可能在某些时候利用 purrr 包,或其他一些 r 函数?
``` r
library(tidyverse)
set.seed(123)
df<-data_frame(SectionName = rep(letters[1:2], 50),
TimeSpentSeconds = sample(0:360, 100, replace = TRUE),
Correct = sample(0:1, 100, replace = TRUE))
time_spent_range <- function(value, start, end, interval) {
end <- end + (end%%interval) # make sure the end value is divisible by the interval
bins_start <- seq(start, end - interval, by = interval)
bins_end <- seq(start + interval, end, by = interval)
bins_tibble <- tibble(bin_start = bins_start,
bin_end = bins_end) %>%
mutate(in_bin = if_else((value > bin_start|(value == 0 & bin_start == 0))
& value <= bin_end,
1,
0)) %>%
filter(in_bin == 1)
bin <- paste0(as.character(bins_tibble$bin_start[1]),
'-',
as.character(bins_tibble$bin_end[1]),
' Seconds')
return(bin)
}
df %>%
mutate(TimeGroup = map_chr(TimeSpentSeconds, time_spent_range, start = 0, end = max(df$TimeSpentSeconds) , interval = 30))
#> # A tibble: 100 x 4
#> SectionName TimeSpentSeconds Correct TimeGroup
#> <chr> <int> <int> <chr>
#> 1 a 103 1 90-120 Seconds
#> 2 b 284 0 270-300 Seconds
#> 3 a 147 0 120-150 Seconds
#> 4 b 318 1 300-330 Seconds
#> 5 a 339 0 330-360 Seconds
#> 6 b 16 1 0-30 Seconds
#> 7 a 190 1 180-210 Seconds
#> 8 b 322 1 300-330 Seconds
#> 9 a 199 0 180-210 Seconds
#> 10 b 164 0 150-180 Seconds
#> # ... with 90 more rows
```
由 reprex 创建于 2018-08-26 包 (v0.2.0).
我们可以使用 cut
(或 findInterval
)而不是多个嵌套的 ifelse
语句
lbls <- c('LessThan30secs', '30-60 Seconds', '60-90 Seconds',
'90-120 Seconds', '120-150 Seconds', '150-180 Seconds', 'GreaterThan180Seconds')
df %>%
group_by(TimeGroup = cut(TimeSpentSeconds,
breaks = c(seq(0, 180, by = 30), Inf), labels = lbls),
SectionName) %>%
summarise(Correct = sum(Correct)) %>%
na.omit
case_when()
会为所欲为。它是嵌套 ifelse()
语句的简洁替代品。
library(dplyr)
mutate(df,
TimeGroup = case_when(
TimeSpentSeconds <= 30 ~ "30 Seconds or less",
TimeSpentSeconds <= 60 ~ "31-60 Seconds",
TimeSpentSeconds <= 90 ~ "61-90 Seconds",
TimeSpentSeconds <= 120 ~ "91-120 Seconds",
TimeSpentSeconds <= 150 ~ "121-150 Seconds",
TimeSpentSeconds <= 180 ~ "151-180 Seconds",
TimeSpentSeconds > 180 ~ "Greater Than 180 Seconds",
TRUE ~ NA_character_)
)
最后一个参数是对不符合任何条件的记录的全部捕获,例如时间是否小于 0 秒。