添加特殊标记会更改所有嵌入 - TF Bert Hugging Face

Adding Special Tokens Changes all Embeddings - TF Bert Hugging Face

鉴于以下情况,

from transformers import TFAutoModel
from transformers import BertTokenizer


bert = TFAutoModel.from_pretrained('bert-base-cased')
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

我原以为如果在令牌中添加特殊令牌,其余令牌将保持不变,但事实并非如此。例如,我预计以下内容应该相等,但所有标记都发生了变化。这是为什么?

tokens = tokenizer(['this product is no good'], add_special_tokens=True,return_tensors='tf')
output = bert(tokens)

output[0][0][1]

tokens = tokenizer(['this product is no good'], add_special_tokens=False,return_tensors='tf')
output = bert(tokens)

output[0][0][0]

当设置 add_special_tokens=True 时,您在句子的前面包括了 [CLS] 标记,在句子的末尾包括了 [SEP] 标记,这导致总共有 7 个标记共 5 个:

tokens = tokenizer(['this product is no good'], add_special_tokens=True, return_tensors='tf')
print(tokenizer.convert_ids_to_tokens(tf.squeeze(tokens['input_ids'], axis=0)))
['[CLS]', 'this', 'product', 'is', 'no', 'good', '[SEP]']

您的句子级嵌入不同,因为这两个 special 标记在通过 BERT 模型传播时成为您嵌入的一部分。它们不像填充标记 [pad] 那样被屏蔽。查看 docs 了解更多信息。

如果你仔细研究一下 Bert 的 Transformer-Encoder 架构和注意力机制是如何工作的,你就会很快理解为什么两个句子之间的单一差异会产生不同的 hidden_states。新令牌不是简单地连接到现有令牌。从某种意义上说,代币是相互依赖的。根据 BERT 作者 Jacob Devlin 的说法:

I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations.

或者另一个有趣的 discussion:

[...] The value of CLS is influenced by other tokens, just like other tokens are influenced by their context (attention).