如何使用 SpaCy Matcher 更快地找到匹配项?

How to find matches faster with SpaCy Matcher?

我正在尝试使用 SpaCy Matcher package 来检测句子中是否存在被动语态的匹配项。我写了下面的模式,它正确地找到了被动动词和句子。虽然我现在的问题是速度。我有大约 100 万条记录,每条记录大约有 10 个句子。我想知道我是否可以做些什么来提高搜索效率?喜欢不返回结束和开始标记?

匹配器:

matcher = Matcher(nlp.vocab)
passive_rule1 = [{'DEP':'nsubjpass', 'OP':'*'}, {'DEP':'xcomp', 'OP':'*'}, {'DEP':'aux','OP':'*'},{'DEP':'auxpass'}, {'DEP':'nsubj', 'OP':'*'}, {'TAG':'VBN'}]
passive_rule2 =  [{'DEP': 'attr'}, {'DEP':'det', 'OP':'*'}, {'Tag':'NOUN', 'OP': '?'}, {'TAG':'VBN'}]

matcher.add('passive_rule1',None, passive_rule1)
matcher.add('passive_rule2 ', None, passive_rule2)

找到匹配项:

df.loc[:, 'PassiveVoice'] = df.Sentence.apply(lambda x:1 if len(matcher(nlp(x)))>0 else 0)

或者,如果有人有任何其他想法,我将很乐意听到!

投入您的 100 万。文本到 pandas' dataframe 然后循环调用 nlp 100 万次是个坏主意。相反,通过 df["Sentence"].tolist() 将您的文档放入列表中,并通过 nlp.pipe:

有效地处理它们
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_md", disable=["ner"])

matcher = Matcher(nlp.vocab)
passive_rule1 = [
    {"DEP": "nsubjpass", "OP": "*"},
    {"DEP": "xcomp", "OP": "*"},
    {"DEP": "aux", "OP": "*"},
    {"DEP": "auxpass"},
    {"DEP": "nsubj", "OP": "*"},
    {"TAG": "VBN"},
]
passive_rule2 = [
    {"DEP": "attr"},
    {"DEP": "det", "OP": "*"},
    {"Tag": "NOUN", "OP": "?"},
    {"TAG": "VBN"},
]

matcher.add("passive_rule1", None, passive_rule1)
matcher.add("passive_rule2", None, passive_rule2)

texts = ["this is my first sentence. about something", "this is another"]
# texts = df["Sentence"].tolist()
docs = nlp.pipe(texts, n_process = 2, batch_size=50)

for doc in docs:
    if matcher(doc):
        #do something

另外请注意,使用 nlp.pipe(),您可以使用 n_process=2(选择您的)打开多处理,并使用 batch_size=50(选择您的)批量处理您的文本。