如何使用 SpaCy Matcher 更快地找到匹配项？

Question

我正在尝试使用 SpaCy Matcher package 来检测句子中是否存在被动语态的匹配项。我写了下面的模式，它正确地找到了被动动词和句子。虽然我现在的问题是速度。我有大约 100 万条记录，每条记录大约有 10 个句子。我想知道我是否可以做些什么来提高搜索效率？喜欢不返回结束和开始标记？

匹配器：

matcher = Matcher(nlp.vocab)
passive_rule1 = [{'DEP':'nsubjpass', 'OP':'*'}, {'DEP':'xcomp', 'OP':'*'}, {'DEP':'aux','OP':'*'},{'DEP':'auxpass'}, {'DEP':'nsubj', 'OP':'*'}, {'TAG':'VBN'}]
passive_rule2 =  [{'DEP': 'attr'}, {'DEP':'det', 'OP':'*'}, {'Tag':'NOUN', 'OP': '?'}, {'TAG':'VBN'}]

matcher.add('passive_rule1',None, passive_rule1)
matcher.add('passive_rule2 ', None, passive_rule2)

找到匹配项：

df.loc[:, 'PassiveVoice'] = df.Sentence.apply(lambda x:1 if len(matcher(nlp(x)))>0 else 0)

或者，如果有人有任何其他想法，我将很乐意听到！

Answer 1

投入您的 100 万。文本到 pandas' dataframe 然后循环调用 nlp 100 万次是个坏主意。相反，通过 df["Sentence"].tolist() 将您的文档放入列表中，并通过 nlp.pipe:

有效地处理它们

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_md", disable=["ner"])

matcher = Matcher(nlp.vocab)
passive_rule1 = [
    {"DEP": "nsubjpass", "OP": "*"},
    {"DEP": "xcomp", "OP": "*"},
    {"DEP": "aux", "OP": "*"},
    {"DEP": "auxpass"},
    {"DEP": "nsubj", "OP": "*"},
    {"TAG": "VBN"},
]
passive_rule2 = [
    {"DEP": "attr"},
    {"DEP": "det", "OP": "*"},
    {"Tag": "NOUN", "OP": "?"},
    {"TAG": "VBN"},
]

matcher.add("passive_rule1", None, passive_rule1)
matcher.add("passive_rule2", None, passive_rule2)

texts = ["this is my first sentence. about something", "this is another"]
# texts = df["Sentence"].tolist()
docs = nlp.pipe(texts, n_process = 2, batch_size=50)

for doc in docs:
    if matcher(doc):
        #do something

另外请注意，使用 nlp.pipe()，您可以使用 n_process=2（选择您的）打开多处理，并使用 batch_size=50（选择您的）批量处理您的文本。

如何使用 SpaCy Matcher 更快地找到匹配项？

How to find matches faster with SpaCy Matcher?

text

nlp

machine-learning

spacy

data-science