添加特殊标记会更改所有嵌入 - TF Bert Hugging Face
Adding Special Tokens Changes all Embeddings - TF Bert Hugging Face
鉴于以下情况,
from transformers import TFAutoModel
from transformers import BertTokenizer
bert = TFAutoModel.from_pretrained('bert-base-cased')
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
我原以为如果在令牌中添加特殊令牌,其余令牌将保持不变,但事实并非如此。例如,我预计以下内容应该相等,但所有标记都发生了变化。这是为什么?
tokens = tokenizer(['this product is no good'], add_special_tokens=True,return_tensors='tf')
output = bert(tokens)
output[0][0][1]
tokens = tokenizer(['this product is no good'], add_special_tokens=False,return_tensors='tf')
output = bert(tokens)
output[0][0][0]
当设置 add_special_tokens=True
时,您在句子的前面包括了 [CLS]
标记,在句子的末尾包括了 [SEP]
标记,这导致总共有 7 个标记共 5 个:
tokens = tokenizer(['this product is no good'], add_special_tokens=True, return_tensors='tf')
print(tokenizer.convert_ids_to_tokens(tf.squeeze(tokens['input_ids'], axis=0)))
['[CLS]', 'this', 'product', 'is', 'no', 'good', '[SEP]']
您的句子级嵌入不同,因为这两个 special
标记在通过 BERT 模型传播时成为您嵌入的一部分。它们不像填充标记 [pad] 那样被屏蔽。查看 docs 了解更多信息。
如果你仔细研究一下 Bert 的 Transformer-Encoder 架构和注意力机制是如何工作的,你就会很快理解为什么两个句子之间的单一差异会产生不同的 hidden_states
。新令牌不是简单地连接到现有令牌。从某种意义上说,代币是相互依赖的。根据 BERT 作者 Jacob Devlin 的说法:
I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations.
或者另一个有趣的 discussion:
[...] The value of CLS is influenced by other tokens, just like other tokens are influenced by their context (attention).
鉴于以下情况,
from transformers import TFAutoModel
from transformers import BertTokenizer
bert = TFAutoModel.from_pretrained('bert-base-cased')
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
我原以为如果在令牌中添加特殊令牌,其余令牌将保持不变,但事实并非如此。例如,我预计以下内容应该相等,但所有标记都发生了变化。这是为什么?
tokens = tokenizer(['this product is no good'], add_special_tokens=True,return_tensors='tf')
output = bert(tokens)
output[0][0][1]
tokens = tokenizer(['this product is no good'], add_special_tokens=False,return_tensors='tf')
output = bert(tokens)
output[0][0][0]
当设置 add_special_tokens=True
时,您在句子的前面包括了 [CLS]
标记,在句子的末尾包括了 [SEP]
标记,这导致总共有 7 个标记共 5 个:
tokens = tokenizer(['this product is no good'], add_special_tokens=True, return_tensors='tf')
print(tokenizer.convert_ids_to_tokens(tf.squeeze(tokens['input_ids'], axis=0)))
['[CLS]', 'this', 'product', 'is', 'no', 'good', '[SEP]']
您的句子级嵌入不同,因为这两个 special
标记在通过 BERT 模型传播时成为您嵌入的一部分。它们不像填充标记 [pad] 那样被屏蔽。查看 docs 了解更多信息。
如果你仔细研究一下 Bert 的 Transformer-Encoder 架构和注意力机制是如何工作的,你就会很快理解为什么两个句子之间的单一差异会产生不同的 hidden_states
。新令牌不是简单地连接到现有令牌。从某种意义上说,代币是相互依赖的。根据 BERT 作者 Jacob Devlin 的说法:
I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations.
或者另一个有趣的 discussion:
[...] The value of CLS is influenced by other tokens, just like other tokens are influenced by their context (attention).