spacy 标记化合并了错误的标记
spacy tokenization merges the wrong tokens
我想使用 spacy 来标记维基百科的内容。理想情况下它会像这样工作:
text = 'procedure that arbitrates competing models or hypotheses.[2][3] Researchers also use experimentation to test existing theories or new hypotheses to support or disprove them.[3][4]'
# run spacy
spacy_en = spacy.load("en")
doc = spacy_en(text, disable=['tagger', 'ner'])
tokens = [tok.text.lower() for tok in doc]
# desired output
# tokens = [..., 'models', 'or', 'hypotheses', '.', '[2][3]', 'Researchers', ...
# actual output
# tokens = [..., 'models', 'or', 'hypotheses.[2][3', ']', 'Researchers', ...]
问题是 'hypotheses.[2][3]' 粘在一起成为一个令牌。
如何防止 spacy 将此“[2][3]”连接到之前的标记?
只要是从hypotheses这个词和句末的point拆分出来的,我不管怎么处理。但是个别单词和语法应该远离句法噪音。
因此,例如,以下任何一个都是理想的输出:
'hypotheses', '.', '[2][', '3]'
- '
hypotheses', '.', '[2', '][3]'
我想你可以试试中缀:
import re
import spacy
from spacy.tokenizer import Tokenizer
infix_re = re.compile(r'''[.]''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u"hello-world! I am hypothesis.[2][3]")
print([t.text for t in doc])
关于此的更多信息https://spacy.io/usage/linguistic-features#native-tokenizers
我想使用 spacy 来标记维基百科的内容。理想情况下它会像这样工作:
text = 'procedure that arbitrates competing models or hypotheses.[2][3] Researchers also use experimentation to test existing theories or new hypotheses to support or disprove them.[3][4]'
# run spacy
spacy_en = spacy.load("en")
doc = spacy_en(text, disable=['tagger', 'ner'])
tokens = [tok.text.lower() for tok in doc]
# desired output
# tokens = [..., 'models', 'or', 'hypotheses', '.', '[2][3]', 'Researchers', ...
# actual output
# tokens = [..., 'models', 'or', 'hypotheses.[2][3', ']', 'Researchers', ...]
问题是 'hypotheses.[2][3]' 粘在一起成为一个令牌。
如何防止 spacy 将此“[2][3]”连接到之前的标记? 只要是从hypotheses这个词和句末的point拆分出来的,我不管怎么处理。但是个别单词和语法应该远离句法噪音。
因此,例如,以下任何一个都是理想的输出:
'hypotheses', '.', '[2][', '3]'
- '
hypotheses', '.', '[2', '][3]'
我想你可以试试中缀:
import re
import spacy
from spacy.tokenizer import Tokenizer
infix_re = re.compile(r'''[.]''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u"hello-world! I am hypothesis.[2][3]")
print([t.text for t in doc])
关于此的更多信息https://spacy.io/usage/linguistic-features#native-tokenizers