在r中找到一个句子与许多其他句子的余弦相似度

Question

我想用 R 求一个句子与许多其他句子的余弦相似度。例如：

s1 <- "The book is on the table"  
s2 <- "The pen is on the table"  
s3 <- "Put the pen on the book"  
s4 <- "Take the book and pen"  

sn <- "Take the book and pen from the table"

我想求 s1、s2、s3 和 s4 与 sn 的余弦相似度。我知道我必须使用向量（将句子转换为向量并使用 TF-IDF and/or 点积）但是由于我对 R 比较陌生，所以我在实现它时遇到了问题。

感谢所有帮助。

Answer 1

完成问题要求的最佳方法是使用包 stringdist。

library(stringdist)

stringdist(sn, c(s1, s2, s3, s4), method = "cosine")
#[1] 0.06479426 0.08015590 0.09776951 0.04376841

在字符串名称有明显模式的情况下，例如问题中的那些，mget可以使用，不需要一个一个地硬编码字符串名称在对 stringdist.

的调用中

s_vec <- unlist(mget(ls(pattern = "^s\d+")))
stringdist(sn, s_vec, method = "cosine")
#[1] 0.06479426 0.08015590 0.09776951 0.04376841

Answer 2

stringdist 使用的余弦差异不是基于单词或术语，而是基于 qgrams，它是 q 个字符的序列，可能构成也可能不构成单词。我们可以直观地看出，Rui 的回答给出的输出有问题。前两个句子之间的唯一区别是 pen 和 book，而最后一句话包含这两个词一次，所以我们期望s1–sn 和 s2–sn 不同点是相同的，但它们不是。
可能还有其他 R 库可以计算更传统的余弦相似度，但从第一原理来看，我们自己做起来也不难。它最终可能会更具教育意义。

sv <- c(s1=s1, s2=s2, s3=s3, s4=s4, sn=sn)

# Split sentences into words
svs <- strsplit(tolower(sv), "\s+")

# Calculate term frequency tables (tf)
termf <- table(stack(svs))

# Calculate inverse document frequencies (idf)
idf <- log(1/rowMeans(termf != 0))

# Multiply to get tf-idf
tfidf <- termf*idf

# Calculate dot products between the last tf-idf and all the previous
dp <- t(tfidf[,5]) %*% tfidf[,-5]

# Divide by the product of the euclidean norms do get the cosine similarity
cosim <- dp/(sqrt(colSums(tfidf[,-5]^2))*sqrt(sum(tfidf[,5]^2)))
cosim
#           [,1]      [,2]       [,3]      [,4]
# [1,] 0.1215616 0.1215616 0.02694245 0.6198245

在r中找到一个句子与许多其他句子的余弦相似度

Finding the cosine similarity of a sentence with many others in r

r

cosine-similarity