在 Spacy 中向分词器添加一些自定义词
Add some custom words to tokenizer in Spacy
我有一个句子,希望看到如下预期标记。
Sentence: "[x] works for [y] in [z]."
Tokens: ["[", "x", "]", "works", "for", "[", "y", "]", "in", "[", "z", "]", "."]
Expected: ["[x]", "works", "for", "[y]", "in", "[z]", "."]
如何通过自定义分词器函数执行此操作?
您可以从分词器前缀和后缀中删除 [
和 ]
,这样括号就不会从相邻的分词中分离出来:
import spacy
nlp = spacy.load('en_core_web_sm')
prefixes = list(nlp.Defaults.prefixes)
prefixes.remove('\[')
prefix_regex = spacy.util.compile_prefix_regex(prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search
suffixes = list(nlp.Defaults.suffixes)
suffixes.remove('\]')
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search
doc = nlp("[x] works for [y] in [z].")
print([t.text for t in doc])
# ['[x]', 'works', 'for', '[y]', 'in', '[z]', '.']
相关文档在这里:
https://spacy.io/usage/linguistic-features#native-tokenizer-additions
我有一个句子,希望看到如下预期标记。
Sentence: "[x] works for [y] in [z]."
Tokens: ["[", "x", "]", "works", "for", "[", "y", "]", "in", "[", "z", "]", "."]
Expected: ["[x]", "works", "for", "[y]", "in", "[z]", "."]
如何通过自定义分词器函数执行此操作?
您可以从分词器前缀和后缀中删除 [
和 ]
,这样括号就不会从相邻的分词中分离出来:
import spacy
nlp = spacy.load('en_core_web_sm')
prefixes = list(nlp.Defaults.prefixes)
prefixes.remove('\[')
prefix_regex = spacy.util.compile_prefix_regex(prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search
suffixes = list(nlp.Defaults.suffixes)
suffixes.remove('\]')
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search
doc = nlp("[x] works for [y] in [z].")
print([t.text for t in doc])
# ['[x]', 'works', 'for', '[y]', 'in', '[z]', '.']
相关文档在这里:
https://spacy.io/usage/linguistic-features#native-tokenizer-additions