使用 case_when 进行字符串匹配的多种模式
multiple patterns for string matching using case_when
我正在尝试使用 str_detect 和 case_when 根据多种模式重新编码字符串,并将每次出现的重新编码值粘贴到新列中。正确的列是我要实现的输出。
这类似于 and this question 如果不能用 case_when 完成(我认为仅限于一种模式)是否有更好的方法仍然可以使用 tidyverse 来实现?
Fruit=c("Apples","apples, maybe bananas","Oranges","grapes w apples","pears")
Num=c(1,2,3,4,5)
data=data.frame(Num,Fruit)
df= data %>% mutate(Incorrect=
paste(case_when(
str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
str_detect(Fruit, regex("grapes | oranges", ignore_case=TRUE)) ~ "ok",
str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
TRUE ~ "other"
),sep=","))
Num Fruit Incorrect
1 Apples good
2 apples, maybe bananas good
3 Oranges other
4 grapes w apples good
5 pears other
Num Fruit Correct
1 Apples good
2 apples, maybe bananas good,gross
3 Oranges ok
4 grapes w apples ok,good
5 pears other
在case_when
中,如果某一行的条件得到满足,它就停在那里,不再检查任何条件。所以通常在这种情况下,最好将每个条目都放在单独的行中,这样更容易分配值,然后 summarise
所有这些都放在一起。但是,在这种情况下,Fruit
列没有明确的分隔符,一些水果由逗号分隔(,
),一些带有空格,并且它们之间还有其他单词。为了处理所有此类情况,我们将 NA
分配给不匹配的单词,然后在总结期间将其删除。
library(dplyr)
library(stringr)
data %>%
tidyr::separate_rows(Fruit, sep = ",|\s+") %>%
mutate(Correct = case_when(
str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
str_detect(Fruit, regex("grapes|oranges", ignore_case=TRUE)) ~ "ok",
str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
TRUE ~ NA_character_)) %>%
group_by(Num) %>%
summarise(Correct = toString(na.omit(Correct))) %>%
left_join(data)
# Num Correct Fruit
# <dbl> <chr> <fct>
#1 1 good Apples
#2 2 good, gross apples, maybe bananas
#3 3 ok Oranges
#4 4 ok, good grapes w apples
#5 5 sour Lemons
对于更新后的数据,我们可以去掉出现的多余词,然后做
data %>%
mutate(Fruit = gsub("maybe|w", "", Fruit)) %>%
tidyr::separate_rows(Fruit, sep = ",\s+|\s+") %>%
mutate(Correct = case_when(
str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
str_detect(Fruit, regex("grapes|oranges", ignore_case=TRUE)) ~ "ok",
str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
TRUE ~ "other")) %>%
group_by(Num) %>%
summarise(Correct = toString(na.omit(Correct))) %>%
left_join(data)
# Num Correct Fruit
# <dbl> <chr> <fct>
#1 1 good Apples
#2 2 good, gross apples, maybe bananas
#3 3 ok Oranges
#4 4 ok, good grapes w apples
#5 5 other pears
我正在尝试使用 str_detect 和 case_when 根据多种模式重新编码字符串,并将每次出现的重新编码值粘贴到新列中。正确的列是我要实现的输出。
这类似于
Fruit=c("Apples","apples, maybe bananas","Oranges","grapes w apples","pears")
Num=c(1,2,3,4,5)
data=data.frame(Num,Fruit)
df= data %>% mutate(Incorrect=
paste(case_when(
str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
str_detect(Fruit, regex("grapes | oranges", ignore_case=TRUE)) ~ "ok",
str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
TRUE ~ "other"
),sep=","))
Num Fruit Incorrect
1 Apples good
2 apples, maybe bananas good
3 Oranges other
4 grapes w apples good
5 pears other
Num Fruit Correct
1 Apples good
2 apples, maybe bananas good,gross
3 Oranges ok
4 grapes w apples ok,good
5 pears other
在case_when
中,如果某一行的条件得到满足,它就停在那里,不再检查任何条件。所以通常在这种情况下,最好将每个条目都放在单独的行中,这样更容易分配值,然后 summarise
所有这些都放在一起。但是,在这种情况下,Fruit
列没有明确的分隔符,一些水果由逗号分隔(,
),一些带有空格,并且它们之间还有其他单词。为了处理所有此类情况,我们将 NA
分配给不匹配的单词,然后在总结期间将其删除。
library(dplyr)
library(stringr)
data %>%
tidyr::separate_rows(Fruit, sep = ",|\s+") %>%
mutate(Correct = case_when(
str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
str_detect(Fruit, regex("grapes|oranges", ignore_case=TRUE)) ~ "ok",
str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
TRUE ~ NA_character_)) %>%
group_by(Num) %>%
summarise(Correct = toString(na.omit(Correct))) %>%
left_join(data)
# Num Correct Fruit
# <dbl> <chr> <fct>
#1 1 good Apples
#2 2 good, gross apples, maybe bananas
#3 3 ok Oranges
#4 4 ok, good grapes w apples
#5 5 sour Lemons
对于更新后的数据,我们可以去掉出现的多余词,然后做
data %>%
mutate(Fruit = gsub("maybe|w", "", Fruit)) %>%
tidyr::separate_rows(Fruit, sep = ",\s+|\s+") %>%
mutate(Correct = case_when(
str_detect(Fruit, regex("apples", ignore_case=TRUE)) ~ "good",
str_detect(Fruit, regex("bananas", ignore_case=TRUE)) ~ "gross",
str_detect(Fruit, regex("grapes|oranges", ignore_case=TRUE)) ~ "ok",
str_detect(Fruit, regex("lemon", ignore_case=TRUE)) ~ "sour",
TRUE ~ "other")) %>%
group_by(Num) %>%
summarise(Correct = toString(na.omit(Correct))) %>%
left_join(data)
# Num Correct Fruit
# <dbl> <chr> <fct>
#1 1 good Apples
#2 2 good, gross apples, maybe bananas
#3 3 ok Oranges
#4 4 ok, good grapes w apples
#5 5 other pears