如何 运行 通过关键字向量列表并将它们模糊匹配到不同的文件 (R)
How to run through list of keyword vectors and fuzzy match them to a different file (R)
我有两个文件,一个是关键字(大约 2,000 行),另一个是文本(大约 770,000 行)。关键字文件如下所示:
Event Name Keyword
All-day tabby fest tabby, all-day
All-day tabby fest tabby, fest
Maine Coon Grooming maine coon, groom
Maine Coon Grooming coon, groom
keywordFile <- tibble(EventName = c("All-day tabby fest", "All-day tabby fest", "Maine Coon Grooming","Maine Coon Grooming"), Keyword = c("tabby, all-day", "tabby, fest", "maine coon, groom", "coon, groom")
文本文件如下所示:
Description
Bring your tabby to the fest on Tuesday
All cats are welcome to the fest on Tuesday
Mainecoon grooming will happen at noon Wednesday
Maine coons will be pampered at noon on Wednesday
text <- tibble(Description = c("Bring your tabby to the fest on Tuesday","All cats are welcome to the fest on Tuesday","Mainecoon grooming will happen at noon Wednesday","Maine coons will be pampered at noon on Wednesday")
我想要的是遍历文本文件并查找模糊匹配(必须包括 "Keyword" 列中的每个单词)和 return 显示 TRUE 或 False 的新列。如果那是真的,那么我想要第三列来显示事件名称。所以看起来像:
Description Match? Event Name
Bring your tabby to the fest on Tuesday TRUE All-day tabby fest
All cats are welcome to the fest on Tuesday FALSE
Mainecoon grooming will happen at noon Wednesday TRUE Maine Coon Grooming
Maine coons will be pampered at noon on Wednesday FALSE
感谢 Molx ():
str <- c("tabby", "all-day")
myStr <- "Bring your tabby to the fest on Tuesday"
all(sapply(str, grepl, myStr))
但是,当我尝试对整个文件进行模糊匹配时卡住了。我试过这样的事情:
for (i in seq_along(text$Description)){
for (j in seq_along(keywordFile$EventName)) {
# below I am creating the TRUE/FALSE column
text$TF[i] <- all(sapply(keywordFile$Keyword[j], grepl,
text$Description[i]))
if (isTRUE(text$TF))
# below I am creating the EventName column
text$EventName <- keywordFile$EventName
}
}
我不认为我在将正确的东西转换为向量和字符串时遇到问题。我的 keywordFile$Keyword 列是一堆字符串向量,我的 text$Description 列是一个字符串。但是我正在为如何正确地遍历这两个文件而苦苦挣扎。我得到的错误是
Error in ... replacement has 13 rows, data has 1
以前有人做过这样的事吗?
我不确定我是否理解了你的问题,因为我不会调用 grepl()
模糊匹配。如果它在一个较长的单词中,它会捕捉关键字。所以 "cat" 和 "catastrophe" 将是一个匹配事件,认为这些词 非常 不同。
我选择写一个答案,你可以控制仍然构成匹配的字符串之间的距离:
加载库:
library(tibble)
library(dplyr)
library(fuzzyjoin)
library(tidytext)
library(tidyr)
创建字典和数据对象:
dict <- tibble(Event_Name = c(
"All-day tabby fest",
"All-day tabby fest",
"Maine Coon Grooming",
"Maine Coon Grooming"
), Keyword = c(
"tabby, all-day",
"tabby, fest",
"maine coon, groom",
"coon, groom"
)) %>%
mutate(Keyword = strsplit(Keyword, ", ")) %>%
unnest(Keyword)
string <- tibble(id = 1:4, Description = c(
"Bring your tabby to the fest on Tuesday",
"All cats are welcome to the fest on Tuesday",
"Mainecoon grooming will happen at noon Wednesday",
"Maine coons will be pampered at noon on Wednesday"
))
将字典应用于数据:
string_annotated <- string %>%
unnest_tokens(output = "word", input = Description) %>%
stringdist_left_join(y = dict, by = c("word" = "Keyword"), max_dist = 1) %>%
mutate(match = !is.na(Keyword))
> string_annotated
# A tibble: 34 x 5
id word Event_Name Keyword match
<int> <chr> <chr> <chr> <lgl>
1 1 bring NA NA FALSE
2 1 your NA NA FALSE
3 1 tabby All-day tabby fest tabby TRUE
4 1 tabby All-day tabby fest tabby TRUE
5 1 to NA NA FALSE
6 1 the NA NA FALSE
7 1 fest All-day tabby fest fest TRUE
8 1 on NA NA FALSE
9 1 tuesday NA NA FALSE
10 2 all NA NA FALSE
# ... with 24 more rows
max_dist
控制仍然构成匹配的内容。在这种情况下,1
或更小的字符串之间的距离可以找到所有文本的匹配项,但我也尝试使用不匹配的字符串。
如果你想把这个长格式恢复成原来的格式:
string_annotated_col <- string_annotated %>%
group_by(id) %>%
summarise(Description = paste(word, collapse = " "),
match = sum(match),
keywords = toString(unique(na.omit(Keyword))),
Event_Name = toString(unique(na.omit(Event_Name))))
> string_annotated_col
# A tibble: 4 x 5
id Description match keywords Event_Name
<int> <chr> <int> <chr> <chr>
1 1 bring your tabby tabby to the fest on tuesday 3 tabby, fest All-day tabby fest
2 2 all cats are welcome to the fest on tuesday 1 fest All-day tabby fest
3 3 mainecoon grooming will happen at noon wednesday 2 maine coon, coon Maine Coon Grooming
4 4 maine coons will be pampered at noon on wednesday 2 coon Maine Coon Grooming
如果您不明白部分答案,请随时提出问题。其中一些在 here 中有解释。除了模糊匹配部分。
可以在 R 中使用 agrep()
或 grepl()
函数进行近似匹配。它适用于选项 fixed=False
。这些函数不需要任何额外的库。
我有两个文件,一个是关键字(大约 2,000 行),另一个是文本(大约 770,000 行)。关键字文件如下所示:
Event Name Keyword
All-day tabby fest tabby, all-day
All-day tabby fest tabby, fest
Maine Coon Grooming maine coon, groom
Maine Coon Grooming coon, groom
keywordFile <- tibble(EventName = c("All-day tabby fest", "All-day tabby fest", "Maine Coon Grooming","Maine Coon Grooming"), Keyword = c("tabby, all-day", "tabby, fest", "maine coon, groom", "coon, groom")
文本文件如下所示:
Description
Bring your tabby to the fest on Tuesday
All cats are welcome to the fest on Tuesday
Mainecoon grooming will happen at noon Wednesday
Maine coons will be pampered at noon on Wednesday
text <- tibble(Description = c("Bring your tabby to the fest on Tuesday","All cats are welcome to the fest on Tuesday","Mainecoon grooming will happen at noon Wednesday","Maine coons will be pampered at noon on Wednesday")
我想要的是遍历文本文件并查找模糊匹配(必须包括 "Keyword" 列中的每个单词)和 return 显示 TRUE 或 False 的新列。如果那是真的,那么我想要第三列来显示事件名称。所以看起来像:
Description Match? Event Name
Bring your tabby to the fest on Tuesday TRUE All-day tabby fest
All cats are welcome to the fest on Tuesday FALSE
Mainecoon grooming will happen at noon Wednesday TRUE Maine Coon Grooming
Maine coons will be pampered at noon on Wednesday FALSE
感谢 Molx (
str <- c("tabby", "all-day")
myStr <- "Bring your tabby to the fest on Tuesday"
all(sapply(str, grepl, myStr))
但是,当我尝试对整个文件进行模糊匹配时卡住了。我试过这样的事情:
for (i in seq_along(text$Description)){
for (j in seq_along(keywordFile$EventName)) {
# below I am creating the TRUE/FALSE column
text$TF[i] <- all(sapply(keywordFile$Keyword[j], grepl,
text$Description[i]))
if (isTRUE(text$TF))
# below I am creating the EventName column
text$EventName <- keywordFile$EventName
}
}
我不认为我在将正确的东西转换为向量和字符串时遇到问题。我的 keywordFile$Keyword 列是一堆字符串向量,我的 text$Description 列是一个字符串。但是我正在为如何正确地遍历这两个文件而苦苦挣扎。我得到的错误是
Error in ... replacement has 13 rows, data has 1
以前有人做过这样的事吗?
我不确定我是否理解了你的问题,因为我不会调用 grepl()
模糊匹配。如果它在一个较长的单词中,它会捕捉关键字。所以 "cat" 和 "catastrophe" 将是一个匹配事件,认为这些词 非常 不同。
我选择写一个答案,你可以控制仍然构成匹配的字符串之间的距离:
加载库:
library(tibble)
library(dplyr)
library(fuzzyjoin)
library(tidytext)
library(tidyr)
创建字典和数据对象:
dict <- tibble(Event_Name = c(
"All-day tabby fest",
"All-day tabby fest",
"Maine Coon Grooming",
"Maine Coon Grooming"
), Keyword = c(
"tabby, all-day",
"tabby, fest",
"maine coon, groom",
"coon, groom"
)) %>%
mutate(Keyword = strsplit(Keyword, ", ")) %>%
unnest(Keyword)
string <- tibble(id = 1:4, Description = c(
"Bring your tabby to the fest on Tuesday",
"All cats are welcome to the fest on Tuesday",
"Mainecoon grooming will happen at noon Wednesday",
"Maine coons will be pampered at noon on Wednesday"
))
将字典应用于数据:
string_annotated <- string %>%
unnest_tokens(output = "word", input = Description) %>%
stringdist_left_join(y = dict, by = c("word" = "Keyword"), max_dist = 1) %>%
mutate(match = !is.na(Keyword))
> string_annotated
# A tibble: 34 x 5
id word Event_Name Keyword match
<int> <chr> <chr> <chr> <lgl>
1 1 bring NA NA FALSE
2 1 your NA NA FALSE
3 1 tabby All-day tabby fest tabby TRUE
4 1 tabby All-day tabby fest tabby TRUE
5 1 to NA NA FALSE
6 1 the NA NA FALSE
7 1 fest All-day tabby fest fest TRUE
8 1 on NA NA FALSE
9 1 tuesday NA NA FALSE
10 2 all NA NA FALSE
# ... with 24 more rows
max_dist
控制仍然构成匹配的内容。在这种情况下,1
或更小的字符串之间的距离可以找到所有文本的匹配项,但我也尝试使用不匹配的字符串。
如果你想把这个长格式恢复成原来的格式:
string_annotated_col <- string_annotated %>%
group_by(id) %>%
summarise(Description = paste(word, collapse = " "),
match = sum(match),
keywords = toString(unique(na.omit(Keyword))),
Event_Name = toString(unique(na.omit(Event_Name))))
> string_annotated_col
# A tibble: 4 x 5
id Description match keywords Event_Name
<int> <chr> <int> <chr> <chr>
1 1 bring your tabby tabby to the fest on tuesday 3 tabby, fest All-day tabby fest
2 2 all cats are welcome to the fest on tuesday 1 fest All-day tabby fest
3 3 mainecoon grooming will happen at noon wednesday 2 maine coon, coon Maine Coon Grooming
4 4 maine coons will be pampered at noon on wednesday 2 coon Maine Coon Grooming
如果您不明白部分答案,请随时提出问题。其中一些在 here 中有解释。除了模糊匹配部分。
可以在 R 中使用 agrep()
或 grepl()
函数进行近似匹配。它适用于选项 fixed=False
。这些函数不需要任何额外的库。