如何 运行 通过关键字向量列表并将它们模糊匹配到不同的文件 (R)

How to run through list of keyword vectors and fuzzy match them to a different file (R)

我有两个文件,一个是关键字(大约 2,000 行),另一个是文本(大约 770,000 行)。关键字文件如下所示:

Event Name            Keyword
All-day tabby fest    tabby, all-day
All-day tabby fest    tabby, fest
Maine Coon Grooming   maine coon, groom    
Maine Coon Grooming   coon, groom

keywordFile <- tibble(EventName = c("All-day tabby fest", "All-day tabby fest", "Maine Coon Grooming","Maine Coon Grooming"), Keyword = c("tabby, all-day", "tabby, fest", "maine coon, groom", "coon, groom")

文本文件如下所示:

Description
Bring your tabby to the fest on Tuesday
All cats are welcome to the fest on Tuesday
Mainecoon grooming will happen at noon Wednesday
Maine coons will be pampered at noon on Wednesday

text <- tibble(Description = c("Bring your tabby to the fest on Tuesday","All cats are welcome to the fest on Tuesday","Mainecoon grooming will happen at noon Wednesday","Maine coons will be pampered at noon on Wednesday")

我想要的是遍历文本文件并查找模糊匹配(必须包括 "Keyword" 列中的每个单词)和 return 显示 TRUE 或 False 的新列。如果那是真的,那么我想要第三列来显示事件名称。所以看起来像:

Description                                          Match?   Event Name
Bring your tabby to the fest on Tuesday              TRUE     All-day tabby fest
All cats are welcome to the fest on Tuesday          FALSE
Mainecoon grooming will happen at noon Wednesday     TRUE     Maine Coon Grooming
Maine coons will be pampered at noon on Wednesday    FALSE

感谢 Molx ():

str <- c("tabby", "all-day")
myStr <- "Bring your tabby to the fest on Tuesday"
all(sapply(str, grepl, myStr))

但是,当我尝试对整个文件进行模糊匹配时卡住了。我试过这样的事情:

for (i in seq_along(text$Description)){
  for (j in seq_along(keywordFile$EventName)) {
    # below I am creating the TRUE/FALSE column
    text$TF[i] <- all(sapply(keywordFile$Keyword[j], grepl, 
                                                     text$Description[i]))
    if (isTRUE(text$TF))
      # below I am creating the EventName column
      text$EventName <- keywordFile$EventName
    }
}

我不认为我在将正确的东西转换为向量和字符串时遇到问题。我的 keywordFile$Keyword 列是一堆字符串向量,我的 text$Description 列是一个字符串。但是我正在为如何正确地遍历这两个文件而苦苦挣扎。我得到的错误是

Error in ... replacement has 13 rows, data has 1

以前有人做过这样的事吗?

我不确定我是否理解了你的问题,因为我不会调用 grepl() 模糊匹配。如果它在一个较长的单词中,它会捕捉关键字。所以 "cat" 和 "catastrophe" 将是一个匹配事件,认为这些词 非常 不同。

我选择写一个答案,你可以控制仍然构成匹配的字符串之间的距离:

加载库:

library(tibble)
library(dplyr)
library(fuzzyjoin)
library(tidytext)
library(tidyr)

创建字典和数据对象:

dict <- tibble(Event_Name = c(
  "All-day tabby fest",
  "All-day tabby fest",
  "Maine Coon Grooming",
  "Maine Coon Grooming"
), Keyword = c(
  "tabby, all-day",
  "tabby, fest",
  "maine coon, groom",
  "coon, groom"
)) %>% 
  mutate(Keyword = strsplit(Keyword, ", ")) %>% 
  unnest(Keyword)

string <- tibble(id = 1:4, Description = c(
  "Bring your tabby to the fest on Tuesday",
  "All cats are welcome to the fest on Tuesday",
  "Mainecoon grooming will happen at noon Wednesday",
  "Maine coons will be pampered at noon on Wednesday"
))

将字典应用于数据:

string_annotated <- string %>% 
  unnest_tokens(output = "word", input = Description) %>%
  stringdist_left_join(y = dict, by = c("word" = "Keyword"), max_dist = 1) %>% 
  mutate(match = !is.na(Keyword))

> string_annotated
# A tibble: 34 x 5
      id word    Event_Name         Keyword match
   <int> <chr>   <chr>              <chr>   <lgl>
 1     1 bring   NA                 NA      FALSE
 2     1 your    NA                 NA      FALSE
 3     1 tabby   All-day tabby fest tabby   TRUE 
 4     1 tabby   All-day tabby fest tabby   TRUE 
 5     1 to      NA                 NA      FALSE
 6     1 the     NA                 NA      FALSE
 7     1 fest    All-day tabby fest fest    TRUE 
 8     1 on      NA                 NA      FALSE
 9     1 tuesday NA                 NA      FALSE
10     2 all     NA                 NA      FALSE
# ... with 24 more rows

max_dist 控制仍然构成匹配的内容。在这种情况下,1 或更小的字符串之间的距离可以找到所有文本的匹配项,但我也尝试使用不匹配的字符串。

如果你想把这个长格式恢复成原来的格式:

string_annotated_col <- string_annotated %>% 
  group_by(id) %>% 
  summarise(Description = paste(word, collapse = " "),
            match = sum(match),
            keywords = toString(unique(na.omit(Keyword))),
            Event_Name = toString(unique(na.omit(Event_Name))))

> string_annotated_col
# A tibble: 4 x 5
     id Description                                       match keywords         Event_Name         
  <int> <chr>                                             <int> <chr>            <chr>              
1     1 bring your tabby tabby to the fest on tuesday         3 tabby, fest      All-day tabby fest 
2     2 all cats are welcome to the fest on tuesday           1 fest             All-day tabby fest 
3     3 mainecoon grooming will happen at noon wednesday      2 maine coon, coon Maine Coon Grooming
4     4 maine coons will be pampered at noon on wednesday     2 coon             Maine Coon Grooming

如果您不明白部分答案,请随时提出问题。其中一些在 here 中有解释。除了模糊匹配部分。

可以在 R 中使用 agrep()grepl() 函数进行近似匹配。它适用于选项 fixed=False。这些函数不需要任何额外的库。