使用 tidyverse 计算列中字符串的成对相似度
Calculate pairwise similarity of strings in column using tidyverse
如果我有一个包含以下列和行的 df
:
row
text
1
This sentence is very similar to the next sentence
2
This sentence is not very similar to the next sentence
3
You can't sneeze with your eyes opened
...
...
我如何应用一个函数来检查 text
列中的每个值是否在该列的另一行中有相似的句子?我想要做的是删除 text
列的值太相似的行。例如,如何确保列中的单元格与同一列中的另一个字符串的相似度不超过 30%、40% 或 80%?
我想得到的结果如下:
row
text
1
This sentence is very similar to the next sentence
3
You can't sneeze with your eyes opened
...
...
这不是最优雅的解决方案,而且在大型 data.frame 上速度较慢,但您可以使用 stringdist::stringsim
。这可以比较文本和 return 的不同相似性度量(参见 method
论点)。因此,根据您的数据:
df <- tibble::tribble(
~row, ~text,
1, "This sentence is very similar to the next sentence",
2, "This sentence is not very similar to the next sentence",
3, "You can't sneeze with your eyes opened"
)
stringdist::stringsim(df$text[1], df$text)
#> [1] 1.0000000 0.9259259 0.2800000
我们可以将其包装在一个函数中,以将每个文本与之前出现的所有文本进行比较,return 一个逻辑向量。
library(dplyr)
find_dup <- function(string, thres) {
purrr::map_lgl(seq_along(string), function(i) {
sim <- stringdist::stringsim(string[i], string[0:(i - 1)])
any(sim > thres)
})
}
使用 mutate
您可以检查结果是否正确,然后使用 filter()
:
删除重复的条目
df %>%
mutate(dup = find_dup(text, 0.8))
#> # A tibble: 3 × 3
#> row text dup
#> <dbl> <chr> <lgl>
#> 1 1 This sentence is very similar to the next sentence FALSE
#> 2 2 This sentence is not very similar to the next sentence TRUE
#> 3 3 You can't sneeze with your eyes opened FALSE
df %>%
filter(!find_dup(text, 0.8))
#> # A tibble: 2 × 2
#> row text
#> <dbl> <chr>
#> 1 1 This sentence is very similar to the next sentence
#> 2 3 You can't sneeze with your eyes opened
由 reprex package (v2.0.1)
于 2022-02-07 创建
这是一个基于使用 RecordLinkage
包中的 levenshteinSim
函数计算字符串之间的 Levenshtein 距离的解决方案,速度相当快:
library(RecordLinkage)
exclude_similar <- function(text, similarity = 0.8) {
sim_mat <- asplit(outer(text, text, levenshteinSim), 1)
exclude <- unlist(lapply(seq_along(sim_mat), function(x) {
y <- which(sim_mat[[x]] > similarity)
y[y > x]
}
))
answer <- rep(TRUE, length(text))
answer[exclude] <- FALSE
return(answer)
}
你会像这样使用函数:
df[exclude_similar(df$text, similarity = 0.8), ]
#> row text
#> 1 1 This sentence is very similar to the next sentence
#> 3 3 You can't sneeze with your eyes opened
df[exclude_similar(df$text, similarity = 0.1), ]
#> row text
#> 1 1 This sentence is very similar to the next sentence
df[exclude_similar(df$text, similarity = 0.95), ]
#> row text
#> 1 1 This sentence is very similar to the next sentence
#> 2 2 This sentence is not very similar to the next sentence
#> 3 3 You can't sneeze with your eyes opened
由 reprex package (v2.0.1)
于 2022-02-07 创建
使用的数据
df <- read.table(text = "row text
1 \"This sentence is very similar to the next sentence\"
2 \"This sentence is not very similar to the next sentence\"
3 \"You can't sneeze with your eyes opened\"", header = TRUE)
如果我有一个包含以下列和行的 df
:
row | text |
---|---|
1 | This sentence is very similar to the next sentence |
2 | This sentence is not very similar to the next sentence |
3 | You can't sneeze with your eyes opened |
... | ... |
我如何应用一个函数来检查 text
列中的每个值是否在该列的另一行中有相似的句子?我想要做的是删除 text
列的值太相似的行。例如,如何确保列中的单元格与同一列中的另一个字符串的相似度不超过 30%、40% 或 80%?
我想得到的结果如下:
row | text |
---|---|
1 | This sentence is very similar to the next sentence |
3 | You can't sneeze with your eyes opened |
... | ... |
这不是最优雅的解决方案,而且在大型 data.frame 上速度较慢,但您可以使用 stringdist::stringsim
。这可以比较文本和 return 的不同相似性度量(参见 method
论点)。因此,根据您的数据:
df <- tibble::tribble(
~row, ~text,
1, "This sentence is very similar to the next sentence",
2, "This sentence is not very similar to the next sentence",
3, "You can't sneeze with your eyes opened"
)
stringdist::stringsim(df$text[1], df$text)
#> [1] 1.0000000 0.9259259 0.2800000
我们可以将其包装在一个函数中,以将每个文本与之前出现的所有文本进行比较,return 一个逻辑向量。
library(dplyr)
find_dup <- function(string, thres) {
purrr::map_lgl(seq_along(string), function(i) {
sim <- stringdist::stringsim(string[i], string[0:(i - 1)])
any(sim > thres)
})
}
使用 mutate
您可以检查结果是否正确,然后使用 filter()
:
df %>%
mutate(dup = find_dup(text, 0.8))
#> # A tibble: 3 × 3
#> row text dup
#> <dbl> <chr> <lgl>
#> 1 1 This sentence is very similar to the next sentence FALSE
#> 2 2 This sentence is not very similar to the next sentence TRUE
#> 3 3 You can't sneeze with your eyes opened FALSE
df %>%
filter(!find_dup(text, 0.8))
#> # A tibble: 2 × 2
#> row text
#> <dbl> <chr>
#> 1 1 This sentence is very similar to the next sentence
#> 2 3 You can't sneeze with your eyes opened
由 reprex package (v2.0.1)
于 2022-02-07 创建这是一个基于使用 RecordLinkage
包中的 levenshteinSim
函数计算字符串之间的 Levenshtein 距离的解决方案,速度相当快:
library(RecordLinkage)
exclude_similar <- function(text, similarity = 0.8) {
sim_mat <- asplit(outer(text, text, levenshteinSim), 1)
exclude <- unlist(lapply(seq_along(sim_mat), function(x) {
y <- which(sim_mat[[x]] > similarity)
y[y > x]
}
))
answer <- rep(TRUE, length(text))
answer[exclude] <- FALSE
return(answer)
}
你会像这样使用函数:
df[exclude_similar(df$text, similarity = 0.8), ]
#> row text
#> 1 1 This sentence is very similar to the next sentence
#> 3 3 You can't sneeze with your eyes opened
df[exclude_similar(df$text, similarity = 0.1), ]
#> row text
#> 1 1 This sentence is very similar to the next sentence
df[exclude_similar(df$text, similarity = 0.95), ]
#> row text
#> 1 1 This sentence is very similar to the next sentence
#> 2 2 This sentence is not very similar to the next sentence
#> 3 3 You can't sneeze with your eyes opened
由 reprex package (v2.0.1)
于 2022-02-07 创建使用的数据
df <- read.table(text = "row text
1 \"This sentence is very similar to the next sentence\"
2 \"This sentence is not very similar to the next sentence\"
3 \"You can't sneeze with your eyes opened\"", header = TRUE)