如何使用 SpaCy Matcher 更快地找到匹配项?
How to find matches faster with SpaCy Matcher?
我正在尝试使用 SpaCy Matcher package 来检测句子中是否存在被动语态的匹配项。我写了下面的模式,它正确地找到了被动动词和句子。虽然我现在的问题是速度。我有大约 100 万条记录,每条记录大约有 10 个句子。我想知道我是否可以做些什么来提高搜索效率?喜欢不返回结束和开始标记?
匹配器:
matcher = Matcher(nlp.vocab)
passive_rule1 = [{'DEP':'nsubjpass', 'OP':'*'}, {'DEP':'xcomp', 'OP':'*'}, {'DEP':'aux','OP':'*'},{'DEP':'auxpass'}, {'DEP':'nsubj', 'OP':'*'}, {'TAG':'VBN'}]
passive_rule2 = [{'DEP': 'attr'}, {'DEP':'det', 'OP':'*'}, {'Tag':'NOUN', 'OP': '?'}, {'TAG':'VBN'}]
matcher.add('passive_rule1',None, passive_rule1)
matcher.add('passive_rule2 ', None, passive_rule2)
找到匹配项:
df.loc[:, 'PassiveVoice'] = df.Sentence.apply(lambda x:1 if len(matcher(nlp(x)))>0 else 0)
或者,如果有人有任何其他想法,我将很乐意听到!
投入您的 100 万。文本到 pandas' dataframe
然后循环调用 nlp
100 万次是个坏主意。相反,通过 df["Sentence"].tolist()
将您的文档放入列表中,并通过 nlp.pipe
:
有效地处理它们
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_md", disable=["ner"])
matcher = Matcher(nlp.vocab)
passive_rule1 = [
{"DEP": "nsubjpass", "OP": "*"},
{"DEP": "xcomp", "OP": "*"},
{"DEP": "aux", "OP": "*"},
{"DEP": "auxpass"},
{"DEP": "nsubj", "OP": "*"},
{"TAG": "VBN"},
]
passive_rule2 = [
{"DEP": "attr"},
{"DEP": "det", "OP": "*"},
{"Tag": "NOUN", "OP": "?"},
{"TAG": "VBN"},
]
matcher.add("passive_rule1", None, passive_rule1)
matcher.add("passive_rule2", None, passive_rule2)
texts = ["this is my first sentence. about something", "this is another"]
# texts = df["Sentence"].tolist()
docs = nlp.pipe(texts, n_process = 2, batch_size=50)
for doc in docs:
if matcher(doc):
#do something
另外请注意,使用 nlp.pipe()
,您可以使用 n_process=2
(选择您的)打开多处理,并使用 batch_size=50
(选择您的)批量处理您的文本。
我正在尝试使用 SpaCy Matcher package 来检测句子中是否存在被动语态的匹配项。我写了下面的模式,它正确地找到了被动动词和句子。虽然我现在的问题是速度。我有大约 100 万条记录,每条记录大约有 10 个句子。我想知道我是否可以做些什么来提高搜索效率?喜欢不返回结束和开始标记?
匹配器:
matcher = Matcher(nlp.vocab)
passive_rule1 = [{'DEP':'nsubjpass', 'OP':'*'}, {'DEP':'xcomp', 'OP':'*'}, {'DEP':'aux','OP':'*'},{'DEP':'auxpass'}, {'DEP':'nsubj', 'OP':'*'}, {'TAG':'VBN'}]
passive_rule2 = [{'DEP': 'attr'}, {'DEP':'det', 'OP':'*'}, {'Tag':'NOUN', 'OP': '?'}, {'TAG':'VBN'}]
matcher.add('passive_rule1',None, passive_rule1)
matcher.add('passive_rule2 ', None, passive_rule2)
找到匹配项:
df.loc[:, 'PassiveVoice'] = df.Sentence.apply(lambda x:1 if len(matcher(nlp(x)))>0 else 0)
或者,如果有人有任何其他想法,我将很乐意听到!
投入您的 100 万。文本到 pandas' dataframe
然后循环调用 nlp
100 万次是个坏主意。相反,通过 df["Sentence"].tolist()
将您的文档放入列表中,并通过 nlp.pipe
:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_md", disable=["ner"])
matcher = Matcher(nlp.vocab)
passive_rule1 = [
{"DEP": "nsubjpass", "OP": "*"},
{"DEP": "xcomp", "OP": "*"},
{"DEP": "aux", "OP": "*"},
{"DEP": "auxpass"},
{"DEP": "nsubj", "OP": "*"},
{"TAG": "VBN"},
]
passive_rule2 = [
{"DEP": "attr"},
{"DEP": "det", "OP": "*"},
{"Tag": "NOUN", "OP": "?"},
{"TAG": "VBN"},
]
matcher.add("passive_rule1", None, passive_rule1)
matcher.add("passive_rule2", None, passive_rule2)
texts = ["this is my first sentence. about something", "this is another"]
# texts = df["Sentence"].tolist()
docs = nlp.pipe(texts, n_process = 2, batch_size=50)
for doc in docs:
if matcher(doc):
#do something
另外请注意,使用 nlp.pipe()
,您可以使用 n_process=2
(选择您的)打开多处理,并使用 batch_size=50
(选择您的)批量处理您的文本。