SpaCy——词内连字符。如何对待他们一个字?
SpaCy -- intra-word hyphens. How to treat them one word?
以下是作为对 ;
的回答而提供的代码
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re
nlp = spacy.load('en')
infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
infix_re = spacy.util.compile_infix_regex(infixes)
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp.tokenizer = custom_tokenizer(nlp)
s1 = "Marketing-Representative- won't die in car accident."
s2 = "Out-of-box implementation"
for s in s1,s2:
doc = nlp("{}".format(s))
print([token.text for token in doc])
结果
$python3 /tmp/nlp.py
['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']
['Out-of-box', 'implementation']
以下第一个 (r"[./]") 和最后一个 (r"(.'.)") 模式用于什么?
infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
编辑:我希望拆分如下;
那个
是
叶海亚
的
笔记本电脑外壳
.
我希望 spacy 将连字符内的单词视为一个标记,而不会对其他拆分规则产生负面影响。
"That is Yahya's laptop-cover. 3.14!"
["That", "is", "Yahya", "'s", "laptop-cover", ".", "3.14", "!"] (预期)
默认情况下,
import spacy
nlp = spacy.load('en_core_web_md')
for token in nlp("That is Yahya's laptop-cover. 3.14!"):
print (token.text)
SpaCy 给出;
["That", "is", "Yahya", "'s", "laptop", "-", "cover", ".", "3.14", "!"]
然而,
from spacy.util import compile_infix_regex
infixes = nlp.Defaults.prefixes + tuple([r"[-]~"])
infix_re = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
for token in nlp("That is Yahya's laptop-cover. 3.14!"):
print (token.text)
给予;
["That", "is", "Yahya", "'", "s", "laptop-cover.", "3.14", "!"]
NOTE: To see the custom tokenizer that keeps the hyphenated words see the botton of the answer.
这里定义了一个自定义分词器,它使用一组内置 (nlp.Defaults.prefixes
) 和自定义 ([./]
、[-]~
、(.'.)
) 模式。
nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
是元组连接操作,结果像
('§', '%', '=', '—', '–', '\+(?![0-9])', '…', '……', ',', ':', ';', '\!', '\?', '¿', '؟', '¡', '\(', '\)', '\[', '\]', '\{', '\}', '<', '>', '_', '#', '\*', '&', '。', '?', '!', ',', '、', ';', ':', '~', '·', '।', '،', '؛', '٪', '\.\.+', '…', "\'", '"', '”', '“', '`', '‘', '´', '’', '‚', ',', '„', '»', '«', '「', '」', '『', '』', '(', ')', '〔', '〕', '【', '】', '《', '》', '〈', '〉', '\$', '£', '€', '¥', '฿', 'US\$', 'C\$', 'A\$', '₽', '﷼', '₴', '[\u00A6\u00A9\u00AE\u00B0\u0482\u058D\u058E\u060E\u060F\u06DE\u06E9\u06FD\u06FE\u07F6\u09FA\u0B70\u0BF3-\u0BF8\u0BFA\u0C7F\u0D4F\u0D79\u0F01-\u0F03\u0F13\u0F15-\u0F17\u0F1A-\u0F1F\u0F34\u0F36\u0F38\u0FBE-\u0FC5\u0FC7-\u0FCC\u0FCE\u0FCF\u0FD5-\u0FD8\u109E\u109F\u1390-\u1399\u1940\u19DE-\u19FF\u1B61-\u1B6A\u1B74-\u1B7C\u2100\u2101\u2103-\u2106\u2108\u2109\u2114\u2116\u2117\u211E-\u2123\u2125\u2127\u2129\u212E\u213A\u213B\u214A\u214C\u214D\u214F\u218A\u218B\u2195-\u2199\u219C-\u219F\u21A1\u21A2\u21A4\u21A5\u21A7-\u21AD\u21AF-\u21CD\u21D0\u21D1\u21D3\u21D5-\u21F3\u2300-\u2307\u230C-\u231F\u2322-\u2328\u232B-\u237B\u237D-\u239A\u23B4-\u23DB\u23E2-\u2426\u2440-\u244A\u249C-\u24E9\u2500-\u25B6\u25B8-\u25C0\u25C2-\u25F7\u2600-\u266E\u2670-\u2767\u2794-\u27BF\u2800-\u28FF\u2B00-\u2B2F\u2B45\u2B46\u2B4D-\u2B73\u2B76-\u2B95\u2B98-\u2BC8\u2BCA-\u2BFE\u2CE5-\u2CEA\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u2FF0-\u2FFB\u3004\u3012\u3013\u3020\u3036\u3037\u303E\u303F\u3190\u3191\u3196-\u319F\u31C0-\u31E3\u3200-\u321E\u322A-\u3247\u3250\u3260-\u327F\u328A-\u32B0\u32C0-\u32FE\u3300-\u33FF\u4DC0-\u4DFF\uA490-\uA4C6\uA828-\uA82B\uA836\uA837\uA839\uAA77-\uAA79\uFDFD\uFFE4\uFFE8\uFFED\uFFEE\uFFFC\uFFFD\U00010137-\U0001013F\U00010179-\U00010189\U0001018C-\U0001018E\U00010190-\U0001019B\U000101A0\U000101D0-\U000101FC\U00010877\U00010878\U00010AC8\U0001173F\U00016B3C-\U00016B3F\U00016B45\U0001BC9C\U0001D000-\U0001D0F5\U0001D100-\U0001D126\U0001D129-\U0001D164\U0001D16A-\U0001D16C\U0001D183\U0001D184\U0001D18C-\U0001D1A9\U0001D1AE-\U0001D1E8\U0001D200-\U0001D241\U0001D245\U0001D300-\U0001D356\U0001D800-\U0001D9FF\U0001DA37-\U0001DA3A\U0001DA6D-\U0001DA74\U0001DA76-\U0001DA83\U0001DA85\U0001DA86\U0001ECAC\U0001F000-\U0001F02B\U0001F030-\U0001F093\U0001F0A0-\U0001F0AE\U0001F0B1-\U0001F0BF\U0001F0C1-\U0001F0CF\U0001F0D1-\U0001F0F5\U0001F110-\U0001F16B\U0001F170-\U0001F1AC\U0001F1E6-\U0001F202\U0001F210-\U0001F23B\U0001F240-\U0001F248\U0001F250\U0001F251\U0001F260-\U0001F265\U0001F300-\U0001F3FA\U0001F400-\U0001F6D4\U0001F6E0-\U0001F6EC\U0001F6F0-\U0001F6F9\U0001F700-\U0001F773\U0001F780-\U0001F7D8\U0001F800-\U0001F80B\U0001F810-\U0001F847\U0001F850-\U0001F859\U0001F860-\U0001F887\U0001F890-\U0001F8AD\U0001F900-\U0001F90B\U0001F910-\U0001F93E\U0001F940-\U0001F970\U0001F973-\U0001F976\U0001F97A\U0001F97C-\U0001F9A2\U0001F9B0-\U0001F9B9\U0001F9C0-\U0001F9C2\U0001F9D0-\U0001F9FF\U0001FA60-\U0001FA6D]', '[/.]', '-~', "(.'.)")
如你所见,这些都是正则表达式,用于处理词内标点、中缀。见 Spacy tokenizer algorithm:
The algorithm can be summarized as follows:
- Iterate over space-separated substrings
- Check whether we have an explicitly defined rule for this substring. If we do, use it.
- Otherwise, try to consume a prefix.
- If we consumed a prefix, go back to the beginning of the loop, so that special-cases always get priority.
- If we didn’t consume a prefix, try to consume a suffix.
- If we can’t consume a prefix or suffix, look for “infixes” — stuff like hyphens etc.
- Once we can’t consume any more of the string, handle it as a single token.
现在,当我们处于中缀处理步骤时,这些正则表达式也用于根据这些模式将文本拆分为标记。
例如[/.]
很重要,因为如果不添加它,abc.def/ghi
将是一个单独的标记,但添加了模式后,它将拆分为 'abc', '.', 'def', '/', 'ghi'
.
[-]~
(与 -~
相同)匹配 -
并想在后面匹配 ~
,但由于它不存在, -
被跳过并且没有分裂发生,你得到整个 'Marketing-Representative-'
令牌。但是请注意,如果句子中有 'Marketing-~Representative-'
,并且使用 -~
正则表达式,您将得到 ['Marketing', '-~', 'Representative-']
作为结果,因为会有匹配项。
.'.
正则表达式匹配任何字符 + '
+ 任何字符。点匹配正则表达式中的任何字符。因此,该规则只是将这些标记从句子中标记化(拆分出来)(例如 n't
、r'd 等)
编辑答案
您在添加新规则时应该非常小心,并检查它们是否与已添加的规则不重叠。
例如当你添加 r"\b's\b"
来拆分 Genetive case apostrophe-s 时,你应该 "override" 来自 nlp.Defaults.prefixes
的 "\'"
规则。如果您不打算将 '
匹配为中缀,请将其删除,或者通过将 nlp.Defaults.prefixes
附加到这些规则来优先考虑您的自定义规则,反之亦然。
查看示例代码:
import re
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load("en_core_web_md")
infixes = tuple([r"'s\b", r"(?<!\d)\.(?!\d)"]) + nlp.Defaults.prefixes
infix_re = spacy.util.compile_infix_regex(infixes)
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u"That is Yahya's laptop-cover. 3.14!")
print([t.text for t in doc])
输出:['That', 'is', 'Yahya', "'s", 'laptop-cover', '.', '3.14', '!']
详情
r"'s\b"
- 匹配 's
后跟单词边界
r"(?<!\d)\.(?!\d)
- 匹配前面或后面没有数字的 .
。
和如果您想使用自定义分词器将带连字符的单词保留为单个标记,您将不得不重新定义infixes
:r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
行说明了这一点,你需要摆脱它。因为它是唯一包含 -|–|—|--|---|——|~
字符串的项目,所以从 infixes
中删除该项目并重新编译中缀模式会更容易:
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex
nlp = spacy.load("en_core_web_sm")
inf = list(nlp.Defaults.infixes)
inf = [x for x in inf if '-|–|—|--|---|——|~' not in x] # remove the hyphen-between-letters pattern from infix patterns
infix_re = compile_infix_regex(tuple(inf))
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
suffix_search=nlp.tokenizer.suffix_search,
infix_finditer=infix_re.finditer,
token_match=nlp.tokenizer.token_match,
rules=nlp.Defaults.tokenizer_exceptions)
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("That is Yahya's laptop-cover. 3.14!")
print([t.text for t in doc])
# => ['That', 'is', 'Yahya', "'s", 'laptop-cover', '.', '3.14', '!']
以下是作为对
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re
nlp = spacy.load('en')
infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
infix_re = spacy.util.compile_infix_regex(infixes)
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp.tokenizer = custom_tokenizer(nlp)
s1 = "Marketing-Representative- won't die in car accident."
s2 = "Out-of-box implementation"
for s in s1,s2:
doc = nlp("{}".format(s))
print([token.text for token in doc])
结果
$python3 /tmp/nlp.py
['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']
['Out-of-box', 'implementation']
以下第一个 (r"[./]") 和最后一个 (r"(.'.)") 模式用于什么?
infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
编辑:我希望拆分如下;
那个
是
叶海亚
的
笔记本电脑外壳
.
我希望 spacy 将连字符内的单词视为一个标记,而不会对其他拆分规则产生负面影响。
"That is Yahya's laptop-cover. 3.14!"
["That", "is", "Yahya", "'s", "laptop-cover", ".", "3.14", "!"] (预期)
默认情况下,
import spacy
nlp = spacy.load('en_core_web_md')
for token in nlp("That is Yahya's laptop-cover. 3.14!"):
print (token.text)
SpaCy 给出;
["That", "is", "Yahya", "'s", "laptop", "-", "cover", ".", "3.14", "!"]
然而,
from spacy.util import compile_infix_regex
infixes = nlp.Defaults.prefixes + tuple([r"[-]~"])
infix_re = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
for token in nlp("That is Yahya's laptop-cover. 3.14!"):
print (token.text)
给予;
["That", "is", "Yahya", "'", "s", "laptop-cover.", "3.14", "!"]
NOTE: To see the custom tokenizer that keeps the hyphenated words see the botton of the answer.
这里定义了一个自定义分词器,它使用一组内置 (nlp.Defaults.prefixes
) 和自定义 ([./]
、[-]~
、(.'.)
) 模式。
nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
是元组连接操作,结果像
('§', '%', '=', '—', '–', '\+(?![0-9])', '…', '……', ',', ':', ';', '\!', '\?', '¿', '؟', '¡', '\(', '\)', '\[', '\]', '\{', '\}', '<', '>', '_', '#', '\*', '&', '。', '?', '!', ',', '、', ';', ':', '~', '·', '।', '،', '؛', '٪', '\.\.+', '…', "\'", '"', '”', '“', '`', '‘', '´', '’', '‚', ',', '„', '»', '«', '「', '」', '『', '』', '(', ')', '〔', '〕', '【', '】', '《', '》', '〈', '〉', '\$', '£', '€', '¥', '฿', 'US\$', 'C\$', 'A\$', '₽', '﷼', '₴', '[\u00A6\u00A9\u00AE\u00B0\u0482\u058D\u058E\u060E\u060F\u06DE\u06E9\u06FD\u06FE\u07F6\u09FA\u0B70\u0BF3-\u0BF8\u0BFA\u0C7F\u0D4F\u0D79\u0F01-\u0F03\u0F13\u0F15-\u0F17\u0F1A-\u0F1F\u0F34\u0F36\u0F38\u0FBE-\u0FC5\u0FC7-\u0FCC\u0FCE\u0FCF\u0FD5-\u0FD8\u109E\u109F\u1390-\u1399\u1940\u19DE-\u19FF\u1B61-\u1B6A\u1B74-\u1B7C\u2100\u2101\u2103-\u2106\u2108\u2109\u2114\u2116\u2117\u211E-\u2123\u2125\u2127\u2129\u212E\u213A\u213B\u214A\u214C\u214D\u214F\u218A\u218B\u2195-\u2199\u219C-\u219F\u21A1\u21A2\u21A4\u21A5\u21A7-\u21AD\u21AF-\u21CD\u21D0\u21D1\u21D3\u21D5-\u21F3\u2300-\u2307\u230C-\u231F\u2322-\u2328\u232B-\u237B\u237D-\u239A\u23B4-\u23DB\u23E2-\u2426\u2440-\u244A\u249C-\u24E9\u2500-\u25B6\u25B8-\u25C0\u25C2-\u25F7\u2600-\u266E\u2670-\u2767\u2794-\u27BF\u2800-\u28FF\u2B00-\u2B2F\u2B45\u2B46\u2B4D-\u2B73\u2B76-\u2B95\u2B98-\u2BC8\u2BCA-\u2BFE\u2CE5-\u2CEA\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u2FF0-\u2FFB\u3004\u3012\u3013\u3020\u3036\u3037\u303E\u303F\u3190\u3191\u3196-\u319F\u31C0-\u31E3\u3200-\u321E\u322A-\u3247\u3250\u3260-\u327F\u328A-\u32B0\u32C0-\u32FE\u3300-\u33FF\u4DC0-\u4DFF\uA490-\uA4C6\uA828-\uA82B\uA836\uA837\uA839\uAA77-\uAA79\uFDFD\uFFE4\uFFE8\uFFED\uFFEE\uFFFC\uFFFD\U00010137-\U0001013F\U00010179-\U00010189\U0001018C-\U0001018E\U00010190-\U0001019B\U000101A0\U000101D0-\U000101FC\U00010877\U00010878\U00010AC8\U0001173F\U00016B3C-\U00016B3F\U00016B45\U0001BC9C\U0001D000-\U0001D0F5\U0001D100-\U0001D126\U0001D129-\U0001D164\U0001D16A-\U0001D16C\U0001D183\U0001D184\U0001D18C-\U0001D1A9\U0001D1AE-\U0001D1E8\U0001D200-\U0001D241\U0001D245\U0001D300-\U0001D356\U0001D800-\U0001D9FF\U0001DA37-\U0001DA3A\U0001DA6D-\U0001DA74\U0001DA76-\U0001DA83\U0001DA85\U0001DA86\U0001ECAC\U0001F000-\U0001F02B\U0001F030-\U0001F093\U0001F0A0-\U0001F0AE\U0001F0B1-\U0001F0BF\U0001F0C1-\U0001F0CF\U0001F0D1-\U0001F0F5\U0001F110-\U0001F16B\U0001F170-\U0001F1AC\U0001F1E6-\U0001F202\U0001F210-\U0001F23B\U0001F240-\U0001F248\U0001F250\U0001F251\U0001F260-\U0001F265\U0001F300-\U0001F3FA\U0001F400-\U0001F6D4\U0001F6E0-\U0001F6EC\U0001F6F0-\U0001F6F9\U0001F700-\U0001F773\U0001F780-\U0001F7D8\U0001F800-\U0001F80B\U0001F810-\U0001F847\U0001F850-\U0001F859\U0001F860-\U0001F887\U0001F890-\U0001F8AD\U0001F900-\U0001F90B\U0001F910-\U0001F93E\U0001F940-\U0001F970\U0001F973-\U0001F976\U0001F97A\U0001F97C-\U0001F9A2\U0001F9B0-\U0001F9B9\U0001F9C0-\U0001F9C2\U0001F9D0-\U0001F9FF\U0001FA60-\U0001FA6D]', '[/.]', '-~', "(.'.)")
如你所见,这些都是正则表达式,用于处理词内标点、中缀。见 Spacy tokenizer algorithm:
The algorithm can be summarized as follows:
- Iterate over space-separated substrings
- Check whether we have an explicitly defined rule for this substring. If we do, use it.
- Otherwise, try to consume a prefix.
- If we consumed a prefix, go back to the beginning of the loop, so that special-cases always get priority.
- If we didn’t consume a prefix, try to consume a suffix.
- If we can’t consume a prefix or suffix, look for “infixes” — stuff like hyphens etc.
- Once we can’t consume any more of the string, handle it as a single token.
现在,当我们处于中缀处理步骤时,这些正则表达式也用于根据这些模式将文本拆分为标记。
例如[/.]
很重要,因为如果不添加它,abc.def/ghi
将是一个单独的标记,但添加了模式后,它将拆分为 'abc', '.', 'def', '/', 'ghi'
.
[-]~
(与 -~
相同)匹配 -
并想在后面匹配 ~
,但由于它不存在, -
被跳过并且没有分裂发生,你得到整个 'Marketing-Representative-'
令牌。但是请注意,如果句子中有 'Marketing-~Representative-'
,并且使用 -~
正则表达式,您将得到 ['Marketing', '-~', 'Representative-']
作为结果,因为会有匹配项。
.'.
正则表达式匹配任何字符 + '
+ 任何字符。点匹配正则表达式中的任何字符。因此,该规则只是将这些标记从句子中标记化(拆分出来)(例如 n't
、r'd 等)
编辑答案
您在添加新规则时应该非常小心,并检查它们是否与已添加的规则不重叠。
例如当你添加 r"\b's\b"
来拆分 Genetive case apostrophe-s 时,你应该 "override" 来自 nlp.Defaults.prefixes
的 "\'"
规则。如果您不打算将 '
匹配为中缀,请将其删除,或者通过将 nlp.Defaults.prefixes
附加到这些规则来优先考虑您的自定义规则,反之亦然。
查看示例代码:
import re
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load("en_core_web_md")
infixes = tuple([r"'s\b", r"(?<!\d)\.(?!\d)"]) + nlp.Defaults.prefixes
infix_re = spacy.util.compile_infix_regex(infixes)
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u"That is Yahya's laptop-cover. 3.14!")
print([t.text for t in doc])
输出:['That', 'is', 'Yahya', "'s", 'laptop-cover', '.', '3.14', '!']
详情
r"'s\b"
- 匹配's
后跟单词边界r"(?<!\d)\.(?!\d)
- 匹配前面或后面没有数字的.
。
和如果您想使用自定义分词器将带连字符的单词保留为单个标记,您将不得不重新定义infixes
:r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
行说明了这一点,你需要摆脱它。因为它是唯一包含 -|–|—|--|---|——|~
字符串的项目,所以从 infixes
中删除该项目并重新编译中缀模式会更容易:
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex
nlp = spacy.load("en_core_web_sm")
inf = list(nlp.Defaults.infixes)
inf = [x for x in inf if '-|–|—|--|---|——|~' not in x] # remove the hyphen-between-letters pattern from infix patterns
infix_re = compile_infix_regex(tuple(inf))
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
suffix_search=nlp.tokenizer.suffix_search,
infix_finditer=infix_re.finditer,
token_match=nlp.tokenizer.token_match,
rules=nlp.Defaults.tokenizer_exceptions)
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("That is Yahya's laptop-cover. 3.14!")
print([t.text for t in doc])
# => ['That', 'is', 'Yahya', "'s", 'laptop-cover', '.', '3.14', '!']