数字前后度量单位的空间规则匹配器

spacy rule matcher on unit of measure before or after digit

我是 spacy 的新手,我正在尝试匹配某些文本中的某些测量值。我的问题是计量单位有时在值之前,有时在值之后。在其他一些情况下有不同的名称。这是一些代码:

nlp = spacy.load('en_core_web_sm')

# case 1:
text = "the surface is 31 sq"
# case 2:
# text = "the surface is sq 31"
# case 3:
# text = "the surface is square meters 31"
# case 4:
# text = "the surface is 31 square meters"
# case 5:
# text = "the surface is about 31 square meters"
# case 6:
# text = "the surface is 31 kilograms"

pattern = [
    {"IS_STOP": True}, 
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"}, 
    {"LOWER": "sq", "OP": "?"},
    {"LOWER": "square", "OP": "?"},
    {"LOWER": "meters", "OP": "?"},
    {"IS_DIGIT": True}, 
    {"LOWER": "square", "OP": "?"},
    {"LOWER": "meters", "OP": "?"},
    {"LOWER": "sq", "OP": "?"} 
]

doc = nlp(text)

matcher = Matcher(nlp.vocab) 

matcher.add("Surface", None, pattern)

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

我有两个问题: 1 - 模式应该能够匹配所有情况 1 到 5,但在我的情况 1 中,输出是

4898162435462687487 Surface 0 4 the surface is 31
4898162435462687487 Surface 0 5 the surface is 31 sq 

在我看来这是重复匹配。

2 - 情况 6 不应该匹配,但它与我的模式匹配。 关于如何改进这个有什么建议吗?

编辑: 是否可以在模式中构建 OR 条件?像

pattern = [
    {"POS": "DET", "OP": "?"}, 
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"},  
    [
      [{"LOWER": "sq", "OP": "?"},
      {"LOWER": "square", "OP": "?"},
      {"LOWER": "meters", "OP": "?"},
      {"IS_ALPHA": True, "OP": "?"},
      {"LIKE_NUM": True}]
     OR
      [{"LIKE_NUM": True},
      {"LOWER": "square", "OP": "?"},
      {"LOWER": "meters", "OP": "?"},
      {"LOWER": "sq", "OP": "?"} ]
    ]
]

您不能使用这样的 OR,但您可以为同一标签定义不同的模式。因此,您需要两种模式,一种匹配数字与 sqsquaremeters 或前面这些单词的组合,另一种模式匹配数字至少这些词之一。

代码片段:

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

texts = ["the surface is 31 sq", "the surface is sq 31", "the surface is square meters 31",
     "the surface is 31 square meters", "the surface is about 31 square meters", "the surface is 31 kilograms"]
pattern1 = [
      {"IS_STOP": True}, 
      {"LOWER": "surface"}, 
      {"LEMMA": "be", "OP": "?"}, 
      {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"},
      {"LIKE_NUM": True}
    ]
pattern2 = [
      {"IS_STOP": True}, 
      {"LOWER": "surface"}, 
      {"LEMMA": "be", "OP": "?"}, 
      {"IS_ALPHA": True, "OP": "?"},
      {"LIKE_NUM": True},
      {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"}
    ]

matcher = Matcher(nlp.vocab, validate=True)
matcher.add("Surface", None, pattern1)
matcher.add("Surface", None, pattern2)

for text in texts:
  doc = nlp(text)
  matches = matcher(doc)
  for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

输出:

4898162435462687487 Surface 0 5 the surface is 31 sq
4898162435462687487 Surface 0 5 the surface is sq 31
4898162435462687487 Surface 0 6 the surface is square meters 31
4898162435462687487 Surface 0 5 the surface is 31 square
4898162435462687487 Surface 0 6 the surface is about 31 square

{"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"} 部分匹配一个或多个匹配正则表达式的标记(由于 "OP": "+"):

  • ^ - 令牌开始
  • (?i: - 不区分大小写的修饰符组的开始:
    • sq(?:uare)? - sqsquare
    • | - 或
    • m(?:et(?:er|re)s?)? - mmeter/metremeters/metres
  • ) - 小组结束
  • $ - 字符串结尾(此处为标记)。