添加特殊标记会更改所有嵌入 - TF Bert Hugging Face

Question

鉴于以下情况，

from transformers import TFAutoModel
from transformers import BertTokenizer


bert = TFAutoModel.from_pretrained('bert-base-cased')
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

我原以为如果在令牌中添加特殊令牌，其余令牌将保持不变，但事实并非如此。例如，我预计以下内容应该相等，但所有标记都发生了变化。这是为什么？

tokens = tokenizer(['this product is no good'], add_special_tokens=True,return_tensors='tf')
output = bert(tokens)

output[0][0][1]

tokens = tokenizer(['this product is no good'], add_special_tokens=False,return_tensors='tf')
output = bert(tokens)

output[0][0][0]

Answer 1

当设置 add_special_tokens=True 时，您在句子的前面包括了 [CLS] 标记，在句子的末尾包括了 [SEP] 标记，这导致总共有 7 个标记共 5 个：

tokens = tokenizer(['this product is no good'], add_special_tokens=True, return_tensors='tf')
print(tokenizer.convert_ids_to_tokens(tf.squeeze(tokens['input_ids'], axis=0)))

['[CLS]', 'this', 'product', 'is', 'no', 'good', '[SEP]']

您的句子级嵌入不同，因为这两个 special 标记在通过 BERT 模型传播时成为您嵌入的一部分。它们不像填充标记 [pad] 那样被屏蔽。查看 docs 了解更多信息。

如果你仔细研究一下 Bert 的 Transformer-Encoder 架构和注意力机制是如何工作的，你就会很快理解为什么两个句子之间的单一差异会产生不同的 hidden_states。新令牌不是简单地连接到现有令牌。从某种意义上说，代币是相互依赖的。根据 BERT 作者 Jacob Devlin 的说法：

I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations.

或者另一个有趣的 discussion:

[...] The value of CLS is influenced by other tokens, just like other tokens are influenced by their context (attention).

添加特殊标记会更改所有嵌入 - TF Bert Hugging Face

Adding Special Tokens Changes all Embeddings - TF Bert Hugging Face

python

deep-learning

tensorflow

huggingface-transformers