SpaCy——词内连字符。如何对待他们一个字？

Question

以下是作为对 ;

的回答而提供的代码

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re

nlp = spacy.load('en')

infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")

infix_re = spacy.util.compile_infix_regex(infixes)

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

nlp.tokenizer = custom_tokenizer(nlp)

s1 = "Marketing-Representative- won't die in car accident."
s2 = "Out-of-box implementation"

for s in s1,s2:
    doc = nlp("{}".format(s))
    print([token.text for token in doc])

结果

$python3 /tmp/nlp.py  
['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']  
['Out-of-box', 'implementation']

以下第一个 (r"[./]") 和最后一个 (r"(.'.)") 模式用于什么？

infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")

编辑：我希望拆分如下；

那个

是

叶海亚

的

笔记本电脑外壳

.

我希望 spacy 将连字符内的单词视为一个标记，而不会对其他拆分规则产生负面影响。

"That is Yahya's laptop-cover. 3.14!"

["That", "is", "Yahya", "'s", "laptop-cover", ".", "3.14", "!"] (预期)

默认情况下，

import spacy
nlp = spacy.load('en_core_web_md')
for token in nlp("That is Yahya's laptop-cover. 3.14!"):
    print (token.text)

SpaCy 给出；

["That", "is", "Yahya", "'s", "laptop", "-", "cover", ".", "3.14", "!"]

然而，

from spacy.util import compile_infix_regex
infixes = nlp.Defaults.prefixes + tuple([r"[-]~"])
infix_re = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
for token in nlp("That is Yahya's laptop-cover. 3.14!"):
    print (token.text)

给予；

["That", "is", "Yahya", "'", "s", "laptop-cover.", "3.14", "!"]

Answer 1

NOTE: To see the custom tokenizer that keeps the hyphenated words see the botton of the answer.

这里定义了一个自定义分词器，它使用一组内置 (nlp.Defaults.prefixes) 和自定义 ([./]、[-]~、(.'.)) 模式。

nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")是元组连接操作，结果像

('§', '%', '=', '—', '–', '\+(?![0-9])', '…', '……', ',', ':', ';', '\!', '\?', '¿', '؟', '¡', '\(', '\)', '\[', '\]', '\{', '\}', '<', '>', '_', '#', '\*', '&', '。', '？', '！', '，', '、', '；', '：', '～', '·', '।', '،', '؛', '٪', '\.\.+', '…', "\'", '"', '”', '“', '`', '‘', '´', '’', '‚', ',', '„', '»', '«', '「', '」', '『', '』', '（', '）', '〔', '〕', '【', '】', '《', '》', '〈', '〉', '\$', '£', '€', '¥', '฿', 'US\$', 'C\$', 'A\$', '₽', '﷼', '₴', '[\u00A6\u00A9\u00AE\u00B0\u0482\u058D\u058E\u060E\u060F\u06DE\u06E9\u06FD\u06FE\u07F6\u09FA\u0B70\u0BF3-\u0BF8\u0BFA\u0C7F\u0D4F\u0D79\u0F01-\u0F03\u0F13\u0F15-\u0F17\u0F1A-\u0F1F\u0F34\u0F36\u0F38\u0FBE-\u0FC5\u0FC7-\u0FCC\u0FCE\u0FCF\u0FD5-\u0FD8\u109E\u109F\u1390-\u1399\u1940\u19DE-\u19FF\u1B61-\u1B6A\u1B74-\u1B7C\u2100\u2101\u2103-\u2106\u2108\u2109\u2114\u2116\u2117\u211E-\u2123\u2125\u2127\u2129\u212E\u213A\u213B\u214A\u214C\u214D\u214F\u218A\u218B\u2195-\u2199\u219C-\u219F\u21A1\u21A2\u21A4\u21A5\u21A7-\u21AD\u21AF-\u21CD\u21D0\u21D1\u21D3\u21D5-\u21F3\u2300-\u2307\u230C-\u231F\u2322-\u2328\u232B-\u237B\u237D-\u239A\u23B4-\u23DB\u23E2-\u2426\u2440-\u244A\u249C-\u24E9\u2500-\u25B6\u25B8-\u25C0\u25C2-\u25F7\u2600-\u266E\u2670-\u2767\u2794-\u27BF\u2800-\u28FF\u2B00-\u2B2F\u2B45\u2B46\u2B4D-\u2B73\u2B76-\u2B95\u2B98-\u2BC8\u2BCA-\u2BFE\u2CE5-\u2CEA\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u2FF0-\u2FFB\u3004\u3012\u3013\u3020\u3036\u3037\u303E\u303F\u3190\u3191\u3196-\u319F\u31C0-\u31E3\u3200-\u321E\u322A-\u3247\u3250\u3260-\u327F\u328A-\u32B0\u32C0-\u32FE\u3300-\u33FF\u4DC0-\u4DFF\uA490-\uA4C6\uA828-\uA82B\uA836\uA837\uA839\uAA77-\uAA79\uFDFD\uFFE4\uFFE8\uFFED\uFFEE\uFFFC\uFFFD\U00010137-\U0001013F\U00010179-\U00010189\U0001018C-\U0001018E\U00010190-\U0001019B\U000101A0\U000101D0-\U000101FC\U00010877\U00010878\U00010AC8\U0001173F\U00016B3C-\U00016B3F\U00016B45\U0001BC9C\U0001D000-\U0001D0F5\U0001D100-\U0001D126\U0001D129-\U0001D164\U0001D16A-\U0001D16C\U0001D183\U0001D184\U0001D18C-\U0001D1A9\U0001D1AE-\U0001D1E8\U0001D200-\U0001D241\U0001D245\U0001D300-\U0001D356\U0001D800-\U0001D9FF\U0001DA37-\U0001DA3A\U0001DA6D-\U0001DA74\U0001DA76-\U0001DA83\U0001DA85\U0001DA86\U0001ECAC\U0001F000-\U0001F02B\U0001F030-\U0001F093\U0001F0A0-\U0001F0AE\U0001F0B1-\U0001F0BF\U0001F0C1-\U0001F0CF\U0001F0D1-\U0001F0F5\U0001F110-\U0001F16B\U0001F170-\U0001F1AC\U0001F1E6-\U0001F202\U0001F210-\U0001F23B\U0001F240-\U0001F248\U0001F250\U0001F251\U0001F260-\U0001F265\U0001F300-\U0001F3FA\U0001F400-\U0001F6D4\U0001F6E0-\U0001F6EC\U0001F6F0-\U0001F6F9\U0001F700-\U0001F773\U0001F780-\U0001F7D8\U0001F800-\U0001F80B\U0001F810-\U0001F847\U0001F850-\U0001F859\U0001F860-\U0001F887\U0001F890-\U0001F8AD\U0001F900-\U0001F90B\U0001F910-\U0001F93E\U0001F940-\U0001F970\U0001F973-\U0001F976\U0001F97A\U0001F97C-\U0001F9A2\U0001F9B0-\U0001F9B9\U0001F9C0-\U0001F9C2\U0001F9D0-\U0001F9FF\U0001FA60-\U0001FA6D]', '[/.]', '-~', "(.'.)")

如你所见，这些都是正则表达式，用于处理词内标点、中缀。见 Spacy tokenizer algorithm:

The algorithm can be summarized as follows:

Iterate over space-separated substrings

Check whether we have an explicitly defined rule for this substring. If we do, use it.

Otherwise, try to consume a prefix.

If we consumed a prefix, go back to the beginning of the loop, so that special-cases always get priority.

If we didn’t consume a prefix, try to consume a suffix.

If we can’t consume a prefix or suffix, look for “infixes” — stuff like hyphens etc.

Once we can’t consume any more of the string, handle it as a single token.

现在，当我们处于中缀处理步骤时，这些正则表达式也用于根据这些模式将文本拆分为标记。

例如[/.] 很重要，因为如果不添加它，abc.def/ghi 将是一个单独的标记，但添加了模式后，它将拆分为 'abc', '.', 'def', '/', 'ghi'.

[-]~（与 -~ 相同）匹配 - 并想在后面匹配 ~，但由于它不存在， - 被跳过并且没有分裂发生，你得到整个 'Marketing-Representative-' 令牌。但是请注意，如果句子中有 'Marketing-~Representative-'，并且使用 -~ 正则表达式，您将得到 ['Marketing', '-~', 'Representative-'] 作为结果，因为会有匹配项。

.'. 正则表达式匹配任何字符 + ' + 任何字符。点匹配正则表达式中的任何字符。因此，该规则只是将这些标记从句子中标记化（拆分出来）（例如 n't、r'd 等）

编辑答案

您在添加新规则时应该非常小心，并检查它们是否与已添加的规则不重叠。

例如当你添加 r"\b's\b" 来拆分 Genetive case apostrophe-s 时，你应该 "override" 来自 nlp.Defaults.prefixes 的 "\'" 规则。如果您不打算将 ' 匹配为中缀，请将其删除，或者通过将 nlp.Defaults.prefixes 附加到这些规则来优先考虑您的自定义规则，反之亦然。

查看示例代码：

import re
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load("en_core_web_md")
infixes = tuple([r"'s\b", r"(?<!\d)\.(?!\d)"]) +  nlp.Defaults.prefixes
infix_re = spacy.util.compile_infix_regex(infixes)

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u"That is Yahya's laptop-cover. 3.14!")
print([t.text for t in doc])

输出：['That', 'is', 'Yahya', "'s", 'laptop-cover', '.', '3.14', '!']

详情

r"'s\b" - 匹配 's 后跟单词边界
r"(?<!\d)\.(?!\d) - 匹配前面或后面没有数字的 .。

和如果您想使用自定义分词器将带连字符的单词保留为单个标记，您将不得不重新定义infixes：r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS), 行说明了这一点，你需要摆脱它。因为它是唯一包含 -|–|—|--|---|——|~ 字符串的项目，所以从 infixes 中删除该项目并重新编译中缀模式会更容易：

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex

nlp = spacy.load("en_core_web_sm")

inf = list(nlp.Defaults.infixes)
inf = [x for x in inf if '-|–|—|--|---|——|~' not in x] # remove the hyphen-between-letters pattern from infix patterns
infix_re = compile_infix_regex(tuple(inf))

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                                suffix_search=nlp.tokenizer.suffix_search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)

nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("That is Yahya's laptop-cover. 3.14!")
print([t.text for t in doc])
# => ['That', 'is', 'Yahya', "'s", 'laptop-cover', '.', '3.14', '!']

SpaCy——词内连字符。如何对待他们一个字？

SpaCy -- intra-word hyphens. How to treat them one word?

nlp

tokenize

spacy