R 计算多列部分字符串匹配的总和
R count sum of partial string matches over multiple columns
我正在处理一个凌乱的夏令营登记表。表单输出如下:
leaders teen_adventure
1 camp, overnight <NA>
2 <NA> <NA>
3 camp, overnight camp, float, overnight
我想生成新的列,对每个可能的答案求和。
leaders teen_adventure camps overnights floats
1 camp, overnight <NA> 1 1 0
2 <NA> <NA> 0 0 0
3 camp, overnight camp, float, overnight 2 2 1
我骨子里觉得这有一个 dplyr 解决方案,比如:
reprex %>%
mutate(camps = sum(case_when(
str_detect(select(., everything()), "camp") ~ 1,
TRUE ~ 0
)))
或者可能使用 across()。
这里是示例数据集:
# data
reprex <- structure(list(leaders = c("camp, overnight", NA, "camp, overnight"),
teen_adventure = c(NA, NA, "camp, float, overnight")),
row.names = c(NA, -3L), class = "data.frame")
一种方式:
library(stringr)
library(tidyr)
reprex %>%
replace_na(list(leaders='unknown',teen_adventure='unknown'))%>%
mutate(camp=as.numeric(str_detect(leaders, 'camp')+str_detect(teen_adventure,'camp')),
float=as.numeric(str_detect(leaders,'float')+str_detect(teen_adventure,'float')),
overnight=as.numeric(str_detect(leaders,'overnight')+str_detect(teen_adventure,'overnight')))
输出:
leaders teen_adventure camp float overnight
1 camp, overnight unknown 1 0 1
2 unknown unknown 0 0 0
3 camp, overnight camp, float, overnight 2 1 2
此解决方案适用于任意数量的列和值:
reprex %>%
as_tibble %>%
# split the values by `, `
mutate_all(strsplit, ", ") %>%
# map through each column then each cell in order make it a named vector
# for example the first cell : c("camp", "overnight") => c("camp"=1, "overnight"=1)
# then pivot it longer by the row_number (this is done for quickly suming the values)
map_dfr( function(x) x %>% map_dfr( ~ set_names(rep(1, length(.x<-.x[!is.na(.x)])), .x)) %>%
mutate(id = row_number()) %>%
pivot_longer(!id) ) %>%
# group by id and name so group the same variables that are found in the same row
group_by(id, name) %>%
# get the sum
summarise_all(sum, na.rm=T) %>%
ungroup %>%
# return the tibble to wide format
pivot_wider %>%
# remove the id column
select(-id) %>%
# add the original data.frame to it
tibble(reprex, .)
# A tibble: 3 x 5
leaders teen_adventure camp float overnight
<chr> <chr> <dbl> <dbl> <dbl>
1 camp, overnight NA 1 0 1
2 NA NA 0 0 0
3 camp, overnight camp, float, overnight 2 1 2
基本 R 选项
v <- unique(unlist(strsplit(na.omit(unlist(reprex)), ",\s+")))
reprex <- cbind(
reprex,
do.call(
rbind,
lapply(
1:nrow(reprex),
function(k) table(factor(unlist(strsplit(na.omit(unlist(reprex[k, ])), ",\s+")), levels = v))
)
)
)
这给出了
leaders teen_adventure camp overnight float
1 camp, overnight <NA> 1 1 0
2 <NA> <NA> 0 0 0
3 camp, overnight camp, float, overnight 2 2 1
我们可以通过遍历列(map
)来提取带有str_extract_all
的单词,然后使用mtabulate
获取频率计数,绑定list
元素, summarise
获得 sum
的数字列
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(qdapTools)
library(data.table)
reprex %>%
map_dfr(~ str_extract_all(.x, "\w+") %>%
mtabulate, .id = 'grp') %>%
group_by(grp = rowid(grp)) %>%
summarise(across(everything(), sum, na.rm = TRUE),
.groups = 'drop') %>%
select(-grp) %>%
bind_cols(reprex, .)
-输出
# leaders teen_adventure camp overnight float
#1 camp, overnight <NA> 1 1 0
#2 <NA> <NA> 0 0 0
#3 camp, overnight camp, float, overnight 2 2 1
我正在处理一个凌乱的夏令营登记表。表单输出如下:
leaders teen_adventure
1 camp, overnight <NA>
2 <NA> <NA>
3 camp, overnight camp, float, overnight
我想生成新的列,对每个可能的答案求和。
leaders teen_adventure camps overnights floats
1 camp, overnight <NA> 1 1 0
2 <NA> <NA> 0 0 0
3 camp, overnight camp, float, overnight 2 2 1
我骨子里觉得这有一个 dplyr 解决方案,比如:
reprex %>%
mutate(camps = sum(case_when(
str_detect(select(., everything()), "camp") ~ 1,
TRUE ~ 0
)))
或者可能使用 across()。
这里是示例数据集:
# data
reprex <- structure(list(leaders = c("camp, overnight", NA, "camp, overnight"),
teen_adventure = c(NA, NA, "camp, float, overnight")),
row.names = c(NA, -3L), class = "data.frame")
一种方式:
library(stringr)
library(tidyr)
reprex %>%
replace_na(list(leaders='unknown',teen_adventure='unknown'))%>%
mutate(camp=as.numeric(str_detect(leaders, 'camp')+str_detect(teen_adventure,'camp')),
float=as.numeric(str_detect(leaders,'float')+str_detect(teen_adventure,'float')),
overnight=as.numeric(str_detect(leaders,'overnight')+str_detect(teen_adventure,'overnight')))
输出:
leaders teen_adventure camp float overnight
1 camp, overnight unknown 1 0 1
2 unknown unknown 0 0 0
3 camp, overnight camp, float, overnight 2 1 2
此解决方案适用于任意数量的列和值:
reprex %>%
as_tibble %>%
# split the values by `, `
mutate_all(strsplit, ", ") %>%
# map through each column then each cell in order make it a named vector
# for example the first cell : c("camp", "overnight") => c("camp"=1, "overnight"=1)
# then pivot it longer by the row_number (this is done for quickly suming the values)
map_dfr( function(x) x %>% map_dfr( ~ set_names(rep(1, length(.x<-.x[!is.na(.x)])), .x)) %>%
mutate(id = row_number()) %>%
pivot_longer(!id) ) %>%
# group by id and name so group the same variables that are found in the same row
group_by(id, name) %>%
# get the sum
summarise_all(sum, na.rm=T) %>%
ungroup %>%
# return the tibble to wide format
pivot_wider %>%
# remove the id column
select(-id) %>%
# add the original data.frame to it
tibble(reprex, .)
# A tibble: 3 x 5
leaders teen_adventure camp float overnight
<chr> <chr> <dbl> <dbl> <dbl>
1 camp, overnight NA 1 0 1
2 NA NA 0 0 0
3 camp, overnight camp, float, overnight 2 1 2
基本 R 选项
v <- unique(unlist(strsplit(na.omit(unlist(reprex)), ",\s+")))
reprex <- cbind(
reprex,
do.call(
rbind,
lapply(
1:nrow(reprex),
function(k) table(factor(unlist(strsplit(na.omit(unlist(reprex[k, ])), ",\s+")), levels = v))
)
)
)
这给出了
leaders teen_adventure camp overnight float
1 camp, overnight <NA> 1 1 0
2 <NA> <NA> 0 0 0
3 camp, overnight camp, float, overnight 2 2 1
我们可以通过遍历列(map
)来提取带有str_extract_all
的单词,然后使用mtabulate
获取频率计数,绑定list
元素, summarise
获得 sum
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(qdapTools)
library(data.table)
reprex %>%
map_dfr(~ str_extract_all(.x, "\w+") %>%
mtabulate, .id = 'grp') %>%
group_by(grp = rowid(grp)) %>%
summarise(across(everything(), sum, na.rm = TRUE),
.groups = 'drop') %>%
select(-grp) %>%
bind_cols(reprex, .)
-输出
# leaders teen_adventure camp overnight float
#1 camp, overnight <NA> 1 1 0
#2 <NA> <NA> 0 0 0
#3 camp, overnight camp, float, overnight 2 2 1