SpaCy 中的非名词短语分块

Question

抱歉，如果这看起来像一个愚蠢的问题，但我对 Python 和 SpaCy 还是个新手。

我有一个包含客户投诉的数据框。它看起来有点像这样：

df = pd.DataFrame( [[1, 'I was waiting at the bus stop and then suddenly the car mounted the pavement'],
                    [2, 'When we got on the bus, we went upstairs but the bus braked hard and I fell'], 
                    [3, 'The bus was clearly in the wrong lane when it crashed into my car']], 
                    columns = ['ID', 'Text'])

如果我想获取名词短语，那么我可以这样做：

def extract_noun_phrases(text):
    return [(chunk.text, chunk.label_) for chunk in nlp(text).noun_chunks]

def add_noun_phrases(df):
    df['noun_phrases'] = df['Text'].apply(extract_noun_phrases)

add_noun_phrases(df)

如果我想从 df 中提取介词短语怎么办？因此，专门尝试提取以下行：

at the bus stop
in the wrong lane

我知道我打算为此使用 subtree，但我不知道如何将它应用到我的数据集。

Answer 1

介词短语就是介词后跟名词短语。

既然您已经知道如何使用 noun_chunks 来识别名词短语，那么它可能就像检查名词短语之前的标记一样简单。如果这个preceding_token.pos_是'ADP'（APD是副词，介词是副词的一种。），那么你可能已经找到了介词短语

您可以检查 preceding_token.dep_ 是否为 'prep'，而不是检查 pos_。这取决于您启用了 SpaCy 管道的哪些元素，但结果应该是相似的。

SpaCy 中的非名词短语分块

Chunking for non-noun phrases in SpaCy

nlp

python-3.x

spacy