我怎样才能让 Spacy 停止将带连字符的数字和单词拆分成单独的标记?
How can I get Spacy to stop splitting both hyphenated numbers and words into separate tokens?
感谢观看。我正在使用 spaCy 对一段文本执行命名实体识别,但我遇到了一个似乎无法克服的特殊问题。这是一个示例代码:
from spacy.tokenizer import Tokenizer
nlp = spacy.load("en_core_web_sm")
doc = nlp('The Indo-European Caucus won the all-male election 58-32.')
结果如下:
['The', 'Indo', '-', 'European', 'Caucus', 'won', 'the', 'all', '-', 'male', 'election', ',', '58', '-', '32', '.']
我的问题是我需要那些包含连字符的单词和数字作为单个标记通过。我使用以下代码遵循 中给出的示例:
inf = list(nlp.Defaults.infixes)
inf = [x for x in inf if '-|–|—|--|---|——|~' not in x] # remove the hyphen-between-letters pattern from infix patterns
infix_re = compile_infix_regex(tuple(inf))
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
suffix_search=nlp.tokenizer.suffix_search,
infix_finditer=infix_re.finditer,
token_match=nlp.tokenizer.token_match,
rules=nlp.Defaults.tokenizer_exceptions)
nlp.tokenizer = custom_tokenizer(nlp)
这对字母字符有帮助,我得到了这个:
['The', 'Indo-European', 'Caucus', 'won', 'the', 'all-male', 'election', ',', '58', '-', '32', '.']
那好多了,但是 '58-32'
仍然拆分成单独的标记。我尝试了 this answer 并得到了相反的效果:
['The', 'Indo', '-', 'European', 'Caucus', 'won', 'the', 'all', '-', 'male', 'election', ',' '58-32', '.']
如何更改分词器以在两种情况下都给我正确的结果?
您可以结合两种解决方案:
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex
nlp = spacy.load("en_core_web_sm")
def custom_tokenizer(nlp):
inf = list(nlp.Defaults.infixes) # Default infixes
inf.remove(r"(?<=[0-9])[+\-\*^](?=[0-9-])") # Remove the generic op between numbers or between a number and a -
inf = tuple(inf) # Convert inf to tuple
infixes = inf + tuple([r"(?<=[0-9])[+*^](?=[0-9-])", r"(?<=[0-9])-(?=-)"]) # Add the removed rule after subtracting (?<=[0-9])-(?=[0-9]) pattern
infixes = [x for x in infixes if '-|–|—|--|---|——|~' not in x] # Remove - between letters rule
infix_re = compile_infix_regex(infixes)
return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
suffix_search=nlp.tokenizer.suffix_search,
infix_finditer=infix_re.finditer,
token_match=nlp.tokenizer.token_match,
rules=nlp.Defaults.tokenizer_exceptions)
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp('The Indo-European Caucus won the all-male election 58-32.')
print([token.text for token in doc])
输出:
['The', 'Indo-European', 'Caucus', 'won', 'the', 'all-male', 'election', '58-32', '.']
感谢观看。我正在使用 spaCy 对一段文本执行命名实体识别,但我遇到了一个似乎无法克服的特殊问题。这是一个示例代码:
from spacy.tokenizer import Tokenizer
nlp = spacy.load("en_core_web_sm")
doc = nlp('The Indo-European Caucus won the all-male election 58-32.')
结果如下:
['The', 'Indo', '-', 'European', 'Caucus', 'won', 'the', 'all', '-', 'male', 'election', ',', '58', '-', '32', '.']
我的问题是我需要那些包含连字符的单词和数字作为单个标记通过。我使用以下代码遵循
inf = list(nlp.Defaults.infixes)
inf = [x for x in inf if '-|–|—|--|---|——|~' not in x] # remove the hyphen-between-letters pattern from infix patterns
infix_re = compile_infix_regex(tuple(inf))
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
suffix_search=nlp.tokenizer.suffix_search,
infix_finditer=infix_re.finditer,
token_match=nlp.tokenizer.token_match,
rules=nlp.Defaults.tokenizer_exceptions)
nlp.tokenizer = custom_tokenizer(nlp)
这对字母字符有帮助,我得到了这个:
['The', 'Indo-European', 'Caucus', 'won', 'the', 'all-male', 'election', ',', '58', '-', '32', '.']
那好多了,但是 '58-32'
仍然拆分成单独的标记。我尝试了 this answer 并得到了相反的效果:
['The', 'Indo', '-', 'European', 'Caucus', 'won', 'the', 'all', '-', 'male', 'election', ',' '58-32', '.']
如何更改分词器以在两种情况下都给我正确的结果?
您可以结合两种解决方案:
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex
nlp = spacy.load("en_core_web_sm")
def custom_tokenizer(nlp):
inf = list(nlp.Defaults.infixes) # Default infixes
inf.remove(r"(?<=[0-9])[+\-\*^](?=[0-9-])") # Remove the generic op between numbers or between a number and a -
inf = tuple(inf) # Convert inf to tuple
infixes = inf + tuple([r"(?<=[0-9])[+*^](?=[0-9-])", r"(?<=[0-9])-(?=-)"]) # Add the removed rule after subtracting (?<=[0-9])-(?=[0-9]) pattern
infixes = [x for x in infixes if '-|–|—|--|---|——|~' not in x] # Remove - between letters rule
infix_re = compile_infix_regex(infixes)
return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
suffix_search=nlp.tokenizer.suffix_search,
infix_finditer=infix_re.finditer,
token_match=nlp.tokenizer.token_match,
rules=nlp.Defaults.tokenizer_exceptions)
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp('The Indo-European Caucus won the all-male election 58-32.')
print([token.text for token in doc])
输出:
['The', 'Indo-European', 'Caucus', 'won', 'the', 'all-male', 'election', '58-32', '.']