python spacy 向后查找块（在引用之前）

Question

我正在为一个 NLP 项目使用 spacy。使用 Spacy 创建文档时，您可以通过以下方式找出文本中的名词块（也称为 "noun phrases"）：

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"The companies building cars do not want to spend more money in improving diesel engines because the government will not subsidise such engines anymore.")
for chunk in doc.noun_chunks:
    print(chunk.text)

这将给出名词短语列表。

在这种情况下，例如第一个名词短语是 "The companies"。

假设您有一段文本，其中用数字引用名词块。

喜欢：

doc=nlp(the Window (23) is closed because the wall (34) of the beautiful building (45) is not covered by the insurance (45))

假设我有识别引用的代码，例如标记它们：

myprocessedtext=the Window <ref>(23)</ref> is closed because the wall <ref>(34)</ref> of the beautiful building <ref>(45)</ref> is not covered by the insurance <ref>(45)</ref>

如何获取紧接在引用之前的名词块（名词短语）？

我的想法：将每个引用之前的 10 个单词传递给一个 spacy doc 对象，提取名词块并获取最后一个。这是非常低效的，因为创建文档对象非常耗时。

无需创建额外的 nlp 对象还有其他想法吗？

谢谢。

Answer 1

您可以分析整个文档，然后通过标记位置或字符偏移找到每个引用之前的名词块。名词块中最后一个标记的标记偏移量是 noun_chunk[-1].i，最后一个标记开始的字符偏移量是 noun_chunk[-1].idx。（检查分析是否不受引用字符串的影响；您的示例 (1) 样式引用似乎被分析为同位语，这很好。）

如果分析受到引用字符串的影响，将它们从文档中删除，同时跟踪它们的字符偏移量，分析整个文档，然后找到保存位置之前的名词块。

python spacy 向后查找块（在引用之前）

python spacy look for chunks backwards (before a reference)

python

grammar

nlp

chunks

spacy