从 R 中的模式列表中仅提取第一次出现

Question

我有一个国家名称列表和一个包含一列文本和一列二进制指标的数据框。

MWE:

rm(list=ls())

library(countrycode)
country_list <- countrycode::codelist$country.name.en

Text <- c("This is","a test to", "find country", "names like Algeria", "Albania and Afghanistan","in the data","and return only the","first match in each","string, Algeria and Albania", "not Afghanistan")
df <- as.data.frame(Text)
df$ofInterest <- c(0,0,0,1,1,1,0,0,1,0)

我想 return df$Text 中与 country_list 中的任何元素匹配的第一个单词（并且只有第一个单词）。换句话说，我只对提到的第一个国家名称感兴趣。

该操作应在 df 中创建一个新列，指示匹配的国家/地区名称，或者 NA 如果未找到来自 country_list 的匹配项，对于每个行。

为了加快速度，我还想将搜索限制在 df$ofInterest==1.

的行

换句话说，它应该return如下：

Text                       ofInterest   Match
This is                     0           NA   
a test to                   0           NA
find country                0           NA
names like Algeria          1           Algeria
Albania and Afghanistan     1           Albania
in the data                 1           NA
and return only the         0           NA
first match in each         0           NA
string, Algeria and Albania 1           Algeria
not Afghanistan             0           NA

我的问题是我不知道如何在使用正则表达式的同时也从列表进行模式匹配。我如何在 R 中执行此操作？

这是我所能得到的。 “xxxxx”大概是 country_name 列表应该去的地方。

这可能是一个简单的问题，但我找不到解决方案。感谢您的帮助！

df$Match <- ifelse(str_extract(df$Text, "(?<=^| )xxxxx.*?(?=$| )") %in% country_list, str_extract(df$Text, "(?<=^| )xxxxx.*?(?=$| )"), NA)

Answer 1

您可以使用

df$Match <- str_extract(df$Text, paste0("(?i)\b(", paste(country_list, collapse="|"), ")\b"))
df <- within(df, Match[ofInterest == '0'] <- NA)
# > df
#                           Text ofInterest   Match
# 1                      This is          0    <NA>
# 2                    a test to          0    <NA>
# 3                 find country          0    <NA>
# 4           names like Algeria          1 Algeria
# 5      Albania and Afghanistan          1 Albania
# 6                  in the data          1    <NA>
# 7          and return only the          0    <NA>
# 8          first match in each          0    <NA>
# 9  string, Algeria and Albania          1 Algeria
# 10             not Afghanistan          0    <NA>

在这里，paste0("(?i)\b(", paste(country_list, collapse="|"), ")\b") 将创建一个类似

的模式

(?i) - 不区分大小写的匹配
\b - 单词边界
( - 捕获组的开始：
- paste(country_list, collapse="|") 将生成 | 分隔的国家/地区名称列表，例如 Albania|Poland|France 等
) - 小组结束
\b - 单词边界。

df <- within(df, Match[ofInterest == '0'] <- NA) 将在所有 Match 行中恢复 NA，其中 ofInterest 列值为 0。

Answer 2

另一个可能的解决方案，它基于 intersect 和 country_list，在将每个短语拆分为单独的单词并取交集的第一个元素之后：

library(tidyverse)
library(countrycode)

df %>% 
  rowwise %>% 
  mutate(Match = if_else(ofInterest == 1,
   intersect(unlist(str_split(Text,"\s")), country_list)[1], NA_character_)) %>%
  ungroup

#> # A tibble: 10 × 3
#>    Text                        ofInterest Match  
#>    <chr>                            <dbl> <chr>  
#>  1 This is                              0 <NA>   
#>  2 a test to                            0 <NA>   
#>  3 find country                         0 <NA>   
#>  4 names like Algeria                   1 Algeria
#>  5 Albania and Afghanistan              1 Albania
#>  6 in the data                          1 <NA>   
#>  7 and return only the                  0 <NA>   
#>  8 first match in each                  0 <NA>   
#>  9 string, Algeria and Albania          1 Algeria
#> 10 not Afghanistan                      0 <NA>

从 R 中的模式列表中仅提取第一次出现

Extracting only first appearance from a list of patterns in R

regex

r

stringr