Tensorflow 文本分词器不正确的分词
Tensorflow text tokenizer incorrect tokenization
我正在尝试将 TF Tokenizer
用于 NLP 模型
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=200, split=" ")
sample_text = ["This is a sample sentence1 created by sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ",
"This is another sample sentence1 created by another sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]
tokenizer.fit_on_texts(sample_text)
print (tokenizer.texts_to_sequences(["sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]))
OP:
[[1, 7, 8, 9]]
Word_Index:
print(tokenizer.index_word[8]) ===> 'ab'
print(tokenizer.index_word[9]) ===> 'cdefghijklmnopqrstuvwxyz'
问题是 tokenizer
在这种情况下基于 .
创建令牌。我在 Tokenizer
中给出 split = " "
所以我期望以下操作:
[[1,7,8]], where tokenizer.index_word[8] should be 'ab.cdefghijklmnopqrstuvwxyz'
因为我希望分词器基于 space (" ")
而不是任何 special characters
创建 words
如何让 tokenizer
仅在 spaces
上创建令牌?
Tokenizer
接受另一个名为 filter
的参数,该参数当前默认为所有 ascii 标点符号 (filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n'
)。在标记化期间,filter
中包含的所有字符都被指定的 split
字符串替换。
如果您查看 Tokenizer
的源代码,特别是接收 filter
字符的方法 fit_on_texts, you will see it uses the function text_to_word_sequence 并认为它们与 split
它还收到:
def text_to_word_sequence(... ):
...
translate_dict = {c: split for c in filters}
translate_map = maketrans(translate_dict)
text = text.translate(translate_map)
seq = text.split(split)
return [i for i in seq if i]
因此,为了只拆分指定的 split
,只需将空字符串传递给 filter
参数
我正在尝试将 TF Tokenizer
用于 NLP 模型
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=200, split=" ")
sample_text = ["This is a sample sentence1 created by sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ",
"This is another sample sentence1 created by another sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]
tokenizer.fit_on_texts(sample_text)
print (tokenizer.texts_to_sequences(["sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]))
OP:
[[1, 7, 8, 9]]
Word_Index:
print(tokenizer.index_word[8]) ===> 'ab'
print(tokenizer.index_word[9]) ===> 'cdefghijklmnopqrstuvwxyz'
问题是 tokenizer
在这种情况下基于 .
创建令牌。我在 Tokenizer
中给出 split = " "
所以我期望以下操作:
[[1,7,8]], where tokenizer.index_word[8] should be 'ab.cdefghijklmnopqrstuvwxyz'
因为我希望分词器基于 space (" ")
而不是任何 special characters
words
如何让 tokenizer
仅在 spaces
上创建令牌?
Tokenizer
接受另一个名为 filter
的参数,该参数当前默认为所有 ascii 标点符号 (filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n'
)。在标记化期间,filter
中包含的所有字符都被指定的 split
字符串替换。
如果您查看 Tokenizer
的源代码,特别是接收 filter
字符的方法 fit_on_texts, you will see it uses the function text_to_word_sequence 并认为它们与 split
它还收到:
def text_to_word_sequence(... ):
...
translate_dict = {c: split for c in filters}
translate_map = maketrans(translate_dict)
text = text.translate(translate_map)
seq = text.split(split)
return [i for i in seq if i]
因此,为了只拆分指定的 split
,只需将空字符串传递给 filter
参数