Tensorflow 文本分词器不正确的分词

Tensorflow text tokenizer incorrect tokenization

我正在尝试将 TF Tokenizer 用于 NLP 模型

from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=200, split=" ")
sample_text = ["This is a sample sentence1 created by sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ", 
               "This is another sample sentence1 created by another sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]

tokenizer.fit_on_texts(sample_text)

print (tokenizer.texts_to_sequences(["sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]))

OP:

[[1, 7, 8, 9]]

Word_Index:

print(tokenizer.index_word[8])  ===> 'ab'
print(tokenizer.index_word[9])  ===> 'cdefghijklmnopqrstuvwxyz'

问题是 tokenizer 在这种情况下基于 . 创建令牌。我在 Tokenizer 中给出 split = " " 所以我期望以下操作:

[[1,7,8]], where tokenizer.index_word[8] should be 'ab.cdefghijklmnopqrstuvwxyz'

因为我希望分词器基于 space (" ") 而不是任何 special characters

创建 words

如何让 tokenizer 仅在 spaces 上创建令牌?

Tokenizer 接受另一个名为 filter 的参数,该参数当前默认为所有 ascii 标点符号 (filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n')。在标记化期间,filter 中包含的所有字符都被指定的 split 字符串替换。

如果您查看 Tokenizer 的源代码,特别是接收 filter 字符的方法 fit_on_texts, you will see it uses the function text_to_word_sequence 并认为它们与 split 它还收到:

def text_to_word_sequence(... ):
    ...
    translate_dict = {c: split for c in filters}
    translate_map = maketrans(translate_dict)
    text = text.translate(translate_map)

    seq = text.split(split)
    return [i for i in seq if i]

因此,为了只拆分指定的 split,只需将空字符串传递给 filter 参数