使用 tidyverse 计算列中字符串的成对相似度

Calculate pairwise similarity of strings in column using tidyverse

如果我有一个包含以下列和行的 df

row text
1 This sentence is very similar to the next sentence
2 This sentence is not very similar to the next sentence
3 You can't sneeze with your eyes opened
... ...

我如何应用一个函数来检查 text 列中的每个值是否在该列的另一行中有相似的句子?我想要做的是删除 text 列的值太相似的行。例如,如何确保列中的单元格与同一列中的另一个字符串的相似度不超过 30%、40% 或 80%?

我想得到的结果如下:

row text
1 This sentence is very similar to the next sentence
3 You can't sneeze with your eyes opened
... ...

这不是最优雅的解决方案,而且在大型 data.frame 上速度较慢,但​​您可以使用 stringdist::stringsim。这可以比较文本和 return 的不同相似性度量(参见 method 论点)。因此,根据您的数据:

df <- tibble::tribble(
  ~row, ~text,
  1,    "This sentence is very similar to the next sentence",
  2,    "This sentence is not very similar to the next sentence",
  3,    "You can't sneeze with your eyes opened"
)


stringdist::stringsim(df$text[1], df$text)
#> [1] 1.0000000 0.9259259 0.2800000

我们可以将其包装在一个函数中,以将每个文本与之前出现的所有文本进行比较,return 一个逻辑向量。

library(dplyr)
find_dup <- function(string, thres) {
  purrr::map_lgl(seq_along(string), function(i) {
    sim <- stringdist::stringsim(string[i], string[0:(i - 1)])
    any(sim > thres)
  })
}

使用 mutate 您可以检查结果是否正确,然后使用 filter():

删除重复的条目
df %>% 
  mutate(dup = find_dup(text, 0.8))
#> # A tibble: 3 × 3
#>     row text                                                   dup  
#>   <dbl> <chr>                                                  <lgl>
#> 1     1 This sentence is very similar to the next sentence     FALSE
#> 2     2 This sentence is not very similar to the next sentence TRUE 
#> 3     3 You can't sneeze with your eyes opened                 FALSE

df %>% 
  filter(!find_dup(text, 0.8))
#> # A tibble: 2 × 2
#>     row text                                              
#>   <dbl> <chr>                                             
#> 1     1 This sentence is very similar to the next sentence
#> 2     3 You can't sneeze with your eyes opened

reprex package (v2.0.1)

于 2022-02-07 创建

这是一个基于使用 RecordLinkage 包中的 levenshteinSim 函数计算字符串之间的 Levenshtein 距离的解决方案,速度相当快:

library(RecordLinkage)

exclude_similar <- function(text, similarity = 0.8) {
  
 sim_mat <- asplit(outer(text, text, levenshteinSim), 1)
 exclude <- unlist(lapply(seq_along(sim_mat), function(x) {
         y <- which(sim_mat[[x]] > similarity)
         y[y > x]
         }
        ))
 answer <- rep(TRUE, length(text))
 answer[exclude] <- FALSE
 return(answer)
}

你会像这样使用函数:

df[exclude_similar(df$text, similarity = 0.8), ]
#>   row                                               text
#> 1   1 This sentence is very similar to the next sentence
#> 3   3             You can't sneeze with your eyes opened

df[exclude_similar(df$text, similarity = 0.1), ]
#>   row                                               text
#> 1   1 This sentence is very similar to the next sentence

df[exclude_similar(df$text, similarity = 0.95), ]
#>   row                                                   text
#> 1   1     This sentence is very similar to the next sentence
#> 2   2 This sentence is not very similar to the next sentence
#> 3   3                 You can't sneeze with your eyes opened

reprex package (v2.0.1)

于 2022-02-07 创建

使用的数据

df <- read.table(text = "row    text
1   \"This sentence is very similar to the next sentence\"
2   \"This sentence is not very similar to the next sentence\"
3   \"You can't sneeze with your eyes opened\"", header = TRUE)