如何提取和使用句子的 BERT 编码来实现句子之间的文本相似性。 (PyTorch/Tensorflow)

Question

我想制作一个文本相似度模型，我倾向于将其用于常见问题解答查找和其他方法来获取最相关的文本。我想为这个 NLP 任务使用高度优化的 BERT 模型。我倾向于使用所有句子的编码来使用 cosine_similarity 和 return 结果获得相似度矩阵。

在假设条件下，如果我有两个句子 hello world 和 hello hello world 那么我假设 BRT 会给我类似 [0.2,0.3,0]，（0 用于填充）和[0.2,0.2,0.3] 我可以在 sklearn's cosine_similarity.

中传递这两个

我应该如何提取句子的嵌入以在模型中使用它们？我发现某处可以像这样提取它：

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

这是正确的方法吗？因为我在某处读到 BERT 提供了不同类型的嵌入。

另外请建议任何其他方法来查找文本相似度

Answer 1

当您想比较句子的嵌入时，使用 BERT 执行此操作的推荐方法是使用 CLS 令牌的值。这对应于输出的第一个标记（在批维度之后）。

last_hidden_states = outputs[0]
cls_embedding = last_hidden_states[0][0]

这将为您提供整个句子的一个嵌入。由于每个句子的嵌入大小相同，因此您可以轻松计算余弦相似度。

如果使用 CLS 令牌没有得到满意的结果，您还可以尝试对句子中每个单词的输出嵌入进行平均。

如何提取和使用句子的 BERT 编码来实现句子之间的文本相似性。 (PyTorch/Tensorflow)

How to extract and use BERT encodings of sentences for Text similarity among sentences. (PyTorch/Tensorflow)

nlp

deep-learning

tensorflow

pytorch

bert-language-model