如何用正确单词列表替换拼写错误单词列表?
How to replace a list of misspelled words with a list of correct words?
我正在尝试弄清楚如何从正确单词列表中替换一长串拼写错误的单词,但不确定该怎么做。如果可能请告知。谢谢。
我尝试了 str_replace 和 gsub,但似乎是因为我想从数据帧中实现更改,所以它并不能真正起作用。
df = tibble(Movie_Name = list("Black Panthet", "Irom Man", "Captain Anerica", "Black Panthers", "Iron Men", "Captain America", "Avangers"))
correct = tibble(correct_movie_name = list("Black Panther", "Iron Man", "Captain American", "Avengers"))
我希望输出是这样的:
df = tibble(Movie_Name = list("Black Panther", "Iron Man", "Captain America", "Black Panther", "Iron Man", "Captain America", "Avengers"))
一种方法可能是使用 Levenshtein 距离,它可以从 stringdist
包中获得。
library(stringdist)
MovieNames = unlist(df$Movie_Name)
CorrectNames = unlist(correct$correct_movie_name)
for(MN in MovieNames) {
CMN = which.min(stringdist(CorrectNames, MN, method = "lv"))
cat(MN, " should be ", CorrectNames[CMN], "\n")
}
Black Panthet should be Black Panther
Irom Man should be Iron Man
Captain Anerica should be Captain American
Black Panthers should be Black Panther
Iron Men should be Iron Man
Captain America should be Captain American
Avangers should be Avengers
我认为对此没有完美的解决方案。最好的办法是计算 Movie_Name
和 correct_movie_name
之间的某种编辑距离,并用距离最小的 correct_movie_name
中的单词替换。使用什么指标在很大程度上取决于具体情况,并且需要进行大量调整。在这里,我使用了 stringdist
包中的 stringdist
函数,它有多种距离度量可供选择。默认值为 "restricted Damerau-Levenshtein distance"(来自 ?stringdist
)。我们还可以使用 RecordLinkage
包中的 levenshsteinDist
:
library(dplyr)
library(stringdist)
library(RecordLinkage)
replace_names <- function(vec, replace_list, dist_func){
map_chr(vec, ~{
replace_list[which.min(dist_func(.x, replace_list))]
})
}
df %>%
mutate(Correct_stringdist = replace_names(Movie_Name, correct$correct_movie_name, stringdist),
Correct_levenshsteinDist = replace_names(Movie_Name, correct$correct_movie_name, levenshteinDist))
输出:
# A tibble: 7 x 3
Movie_Name Correct_stringdist Correct_levenshsteinDist
<chr> <chr> <chr>
1 Black Panthet Black Panther Black Panther
2 Irom Man Iron Man Iron Man
3 Captain Anerica Captain American Captain American
4 Black Panthers Black Panther Black Panther
5 Iron Men Iron Man Iron Man
6 Captain America Captain American Captain American
7 Avangers Avengers Avengers
agrep
函数可以让您在字符串之间进行近似匹配。
df = tibble(Movie_Name = list("Black Panthet", "Irom Man", "Captain Anerican", "Black Panthers", "Iron Men", "Captain America", "Avangers"))
correct = tibble(correct_movie_name = list("Black Panther", "Iron Man", "Captain America", "Avengers"))
df2 = tibble( Movie_Name = sapply(df$Movie_Name, function(x){
for(i in correct$correct_movie_name){
comparison <- agrep(i, x)
if(length(comparison) != 0){
if(comparison == 1){
return(i)
}}
}
return(x)
}))
这是一个基于@G5W 和 avid_useR
回答的解决方案
library(tidyverse)
library(stringdist)
Movie_Name = list("Black Panthet", "Irom Man", "Captain Anerica", "Black Panthers", "Iron Men", "Captain America", "Avangers")
correct_movie_name = list("Black Panther", "Iron Man", "Captain America", "Avengers")
New_Movie_name <- lapply(Movie_Name, function(x) {
lapply(correct_movie_name, function(y) {
stringdist(x,y)
}) %>% unlist() %>% which.min() %>% correct_movie_name[[.]]
})
# New_Movie_name is a list of the same length as Movie_Name but with correct movie names based on elements in list correct_movie_name
我正在尝试弄清楚如何从正确单词列表中替换一长串拼写错误的单词,但不确定该怎么做。如果可能请告知。谢谢。
我尝试了 str_replace 和 gsub,但似乎是因为我想从数据帧中实现更改,所以它并不能真正起作用。
df = tibble(Movie_Name = list("Black Panthet", "Irom Man", "Captain Anerica", "Black Panthers", "Iron Men", "Captain America", "Avangers"))
correct = tibble(correct_movie_name = list("Black Panther", "Iron Man", "Captain American", "Avengers"))
我希望输出是这样的:
df = tibble(Movie_Name = list("Black Panther", "Iron Man", "Captain America", "Black Panther", "Iron Man", "Captain America", "Avengers"))
一种方法可能是使用 Levenshtein 距离,它可以从 stringdist
包中获得。
library(stringdist)
MovieNames = unlist(df$Movie_Name)
CorrectNames = unlist(correct$correct_movie_name)
for(MN in MovieNames) {
CMN = which.min(stringdist(CorrectNames, MN, method = "lv"))
cat(MN, " should be ", CorrectNames[CMN], "\n")
}
Black Panthet should be Black Panther
Irom Man should be Iron Man
Captain Anerica should be Captain American
Black Panthers should be Black Panther
Iron Men should be Iron Man
Captain America should be Captain American
Avangers should be Avengers
我认为对此没有完美的解决方案。最好的办法是计算 Movie_Name
和 correct_movie_name
之间的某种编辑距离,并用距离最小的 correct_movie_name
中的单词替换。使用什么指标在很大程度上取决于具体情况,并且需要进行大量调整。在这里,我使用了 stringdist
包中的 stringdist
函数,它有多种距离度量可供选择。默认值为 "restricted Damerau-Levenshtein distance"(来自 ?stringdist
)。我们还可以使用 RecordLinkage
包中的 levenshsteinDist
:
library(dplyr)
library(stringdist)
library(RecordLinkage)
replace_names <- function(vec, replace_list, dist_func){
map_chr(vec, ~{
replace_list[which.min(dist_func(.x, replace_list))]
})
}
df %>%
mutate(Correct_stringdist = replace_names(Movie_Name, correct$correct_movie_name, stringdist),
Correct_levenshsteinDist = replace_names(Movie_Name, correct$correct_movie_name, levenshteinDist))
输出:
# A tibble: 7 x 3
Movie_Name Correct_stringdist Correct_levenshsteinDist
<chr> <chr> <chr>
1 Black Panthet Black Panther Black Panther
2 Irom Man Iron Man Iron Man
3 Captain Anerica Captain American Captain American
4 Black Panthers Black Panther Black Panther
5 Iron Men Iron Man Iron Man
6 Captain America Captain American Captain American
7 Avangers Avengers Avengers
agrep
函数可以让您在字符串之间进行近似匹配。
df = tibble(Movie_Name = list("Black Panthet", "Irom Man", "Captain Anerican", "Black Panthers", "Iron Men", "Captain America", "Avangers"))
correct = tibble(correct_movie_name = list("Black Panther", "Iron Man", "Captain America", "Avengers"))
df2 = tibble( Movie_Name = sapply(df$Movie_Name, function(x){
for(i in correct$correct_movie_name){
comparison <- agrep(i, x)
if(length(comparison) != 0){
if(comparison == 1){
return(i)
}}
}
return(x)
}))
这是一个基于@G5W 和 avid_useR
回答的解决方案library(tidyverse)
library(stringdist)
Movie_Name = list("Black Panthet", "Irom Man", "Captain Anerica", "Black Panthers", "Iron Men", "Captain America", "Avangers")
correct_movie_name = list("Black Panther", "Iron Man", "Captain America", "Avengers")
New_Movie_name <- lapply(Movie_Name, function(x) {
lapply(correct_movie_name, function(y) {
stringdist(x,y)
}) %>% unlist() %>% which.min() %>% correct_movie_name[[.]]
})
# New_Movie_name is a list of the same length as Movie_Name but with correct movie names based on elements in list correct_movie_name