spacy 规则匹配器从匹配的句子中提取值
spacy rule-matcher extract value from matched sentence
我在spacy中有一个自定义规则匹配,我可以匹配文档中的一些句子。我现在想从匹配的句子中提取一些数字。然而,匹配的句子并不总是具有相同的形状和形式。执行此操作的最佳方法是什么?
# case 1:
texts = ["the surface is 31 sq",
"the surface is sq 31"
,"the surface is square meters 31"
,"the surface is 31 square meters"
,"the surface is about 31,2 square"
,"the surface is 31 kilograms"]
pattern = [
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
{"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"},
{"IS_ALPHA": True, "OP": "?"},
{"LIKE_NUM": True},
]
pattern_1 = [
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
{"IS_ALPHA": True, "OP": "?"},
{"LIKE_NUM": True},
{"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$", "OP": "+"}}
]
matcher = Matcher(nlp.vocab)
matcher.add("Surface", None, pattern, pattern_1)
for index, text in enumerate(texts):
print(f"Case {index}")
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
我的输出将是
Case 0
4898162435462687487 Surface 1 5 surface is 31 sq
Case 1
4898162435462687487 Surface 1 5 surface is sq 31
Case 2
4898162435462687487 Surface 1 6 surface is square meters 31
Case 3
4898162435462687487 Surface 1 5 surface is 31 square
Case 4
4898162435462687487 Surface 1 6 surface is about 31,2 square
Case 5
我要return个数(平方米)而已。像 [31, 31, 31, 31, 31.2] 这样的东西而不是全文。在 spacy 中执行此操作的正确方法是什么?
由于每个匹配项都包含一次 LIKE_NUM
实体,您可以只解析匹配子树和 return 这种标记的第一次出现:
value = [token for token in span.subtree if token.like_num][0]
测试:
results = []
for text in texts:
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end] # The matched span
results.append([token for token in span.subtree if token.like_num][0])
print(results) # => [31, 31, 31, 31, 31,2]
我在spacy中有一个自定义规则匹配,我可以匹配文档中的一些句子。我现在想从匹配的句子中提取一些数字。然而,匹配的句子并不总是具有相同的形状和形式。执行此操作的最佳方法是什么?
# case 1:
texts = ["the surface is 31 sq",
"the surface is sq 31"
,"the surface is square meters 31"
,"the surface is 31 square meters"
,"the surface is about 31,2 square"
,"the surface is 31 kilograms"]
pattern = [
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
{"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"},
{"IS_ALPHA": True, "OP": "?"},
{"LIKE_NUM": True},
]
pattern_1 = [
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
{"IS_ALPHA": True, "OP": "?"},
{"LIKE_NUM": True},
{"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$", "OP": "+"}}
]
matcher = Matcher(nlp.vocab)
matcher.add("Surface", None, pattern, pattern_1)
for index, text in enumerate(texts):
print(f"Case {index}")
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
我的输出将是
Case 0
4898162435462687487 Surface 1 5 surface is 31 sq
Case 1
4898162435462687487 Surface 1 5 surface is sq 31
Case 2
4898162435462687487 Surface 1 6 surface is square meters 31
Case 3
4898162435462687487 Surface 1 5 surface is 31 square
Case 4
4898162435462687487 Surface 1 6 surface is about 31,2 square
Case 5
我要return个数(平方米)而已。像 [31, 31, 31, 31, 31.2] 这样的东西而不是全文。在 spacy 中执行此操作的正确方法是什么?
由于每个匹配项都包含一次 LIKE_NUM
实体,您可以只解析匹配子树和 return 这种标记的第一次出现:
value = [token for token in span.subtree if token.like_num][0]
测试:
results = []
for text in texts:
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end] # The matched span
results.append([token for token in span.subtree if token.like_num][0])
print(results) # => [31, 31, 31, 31, 31,2]