R return 第一个子串匹配

R return first substring match

我正在尝试对 R 数据帧的列中的字符串进行分类。具体来说,我希望执行以下操作:

  1. 遍历字符串列表
  2. 对于每个字符串,查看它是否是数据框列中的子字符串匹配
  3. 如果是,return子串第一个位置匹配对应的类别

例如,假设我有 dataframe1:

search_string = c('dan likes cake', 'molly likes cupcake', 'flanders likes berries')

我想搜索包含查找和类别的数据框

lookup_df = 
lookups: cake, cupcake, berr
categories: dessert, small dessert, fruit

我想遍历 search_strings(它是数据框中的一列)和 return 以下内容:

'dan likes cake' --> dessert
'molly likes cupcake' --> small dessert
'flanders likes berries' --> fruit

现在我用嵌套循环低效地做这件事。

for (row in 1:nrow(search_string_df)){
   search_string = #search string row

   for (row_x in 1:nrow(lookup_df)){   
      # find first substring match in lookups
      # create a new column in search_string_df with the associated category

   }

}


这需要很长时间,我相信有更好的方法。我不精通 'apply' 和类似功能。我最熟悉 dplyr / tidyverse 语法。

使用tidyverse:

pat <- str_c(lookup_df$lookups,collapse = '|')

data.frame(search_string) %>%
  mutate(lookups = str_extract(search_string, pat)) %>%
  left_join(lookup_df)

 value                  lookups categories   
  <chr>                  <chr>   <chr>        
1 dan likes cake         cake    dessert      
2 molly likes cupcake    cupcake small dessert
3 flanders likes berries berr    fruit 

数据

lookup_df <- data.frame(
             lookups = c('cake', 'cupcake', 'berr'),
              categories= c('dessert', 'small dessert', 'fruit'))
search_string <- c("dan likes cake", "molly likes cupcake", "flanders likes berries")