使用 tidyverse 计算列中字符串的成对相似度

Question

如果我有一个包含以下列和行的 df：

row	text
1	This sentence is very similar to the next sentence
2	This sentence is not very similar to the next sentence
3	You can't sneeze with your eyes opened
...	...

我如何应用一个函数来检查 text 列中的每个值是否在该列的另一行中有相似的句子？我想要做的是删除 text 列的值太相似的行。例如，如何确保列中的单元格与同一列中的另一个字符串的相似度不超过 30%、40% 或 80%？

我想得到的结果如下：

row	text
1	This sentence is very similar to the next sentence
3	You can't sneeze with your eyes opened
...	...

Answer 1

这不是最优雅的解决方案，而且在大型 data.frame 上速度较慢，但您可以使用 stringdist::stringsim。这可以比较文本和 return 的不同相似性度量（参见 method 论点）。因此，根据您的数据：

df <- tibble::tribble(
  ~row, ~text,
  1,    "This sentence is very similar to the next sentence",
  2,    "This sentence is not very similar to the next sentence",
  3,    "You can't sneeze with your eyes opened"
)


stringdist::stringsim(df$text[1], df$text)
#> [1] 1.0000000 0.9259259 0.2800000

我们可以将其包装在一个函数中，以将每个文本与之前出现的所有文本进行比较，return 一个逻辑向量。

library(dplyr)
find_dup <- function(string, thres) {
  purrr::map_lgl(seq_along(string), function(i) {
    sim <- stringdist::stringsim(string[i], string[0:(i - 1)])
    any(sim > thres)
  })
}

使用 mutate 您可以检查结果是否正确，然后使用 filter():

删除重复的条目

df %>% 
  mutate(dup = find_dup(text, 0.8))
#> # A tibble: 3 × 3
#>     row text                                                   dup  
#>   <dbl> <chr>                                                  <lgl>
#> 1     1 This sentence is very similar to the next sentence     FALSE
#> 2     2 This sentence is not very similar to the next sentence TRUE 
#> 3     3 You can't sneeze with your eyes opened                 FALSE

df %>% 
  filter(!find_dup(text, 0.8))
#> # A tibble: 2 × 2
#>     row text                                              
#>   <dbl> <chr>                                             
#> 1     1 This sentence is very similar to the next sentence
#> 2     3 You can't sneeze with your eyes opened

^{由 reprex package (v2.0.1)}

于 2022-02-07 创建

Answer 2

这是一个基于使用 RecordLinkage 包中的 levenshteinSim 函数计算字符串之间的 Levenshtein 距离的解决方案，速度相当快：

library(RecordLinkage)

exclude_similar <- function(text, similarity = 0.8) {
  
 sim_mat <- asplit(outer(text, text, levenshteinSim), 1)
 exclude <- unlist(lapply(seq_along(sim_mat), function(x) {
         y <- which(sim_mat[[x]] > similarity)
         y[y > x]
         }
        ))
 answer <- rep(TRUE, length(text))
 answer[exclude] <- FALSE
 return(answer)
}

你会像这样使用函数：

df[exclude_similar(df$text, similarity = 0.8), ]
#>   row                                               text
#> 1   1 This sentence is very similar to the next sentence
#> 3   3             You can't sneeze with your eyes opened

df[exclude_similar(df$text, similarity = 0.1), ]
#>   row                                               text
#> 1   1 This sentence is very similar to the next sentence

df[exclude_similar(df$text, similarity = 0.95), ]
#>   row                                                   text
#> 1   1     This sentence is very similar to the next sentence
#> 2   2 This sentence is not very similar to the next sentence
#> 3   3                 You can't sneeze with your eyes opened

^{由 reprex package (v2.0.1)}

于 2022-02-07 创建

使用的数据

df <- read.table(text = "row    text
1   \"This sentence is very similar to the next sentence\"
2   \"This sentence is not very similar to the next sentence\"
3   \"You can't sneeze with your eyes opened\"", header = TRUE)

使用 tidyverse 计算列中字符串的成对相似度

Calculate pairwise similarity of strings in column using tidyverse

r

dplyr

tidyverse