从具有基于短语的描述列的数据框中提取句子

Question

我有一个包含产品详细信息的 'description' 列的数据框。专栏中的每个描述都有很长的段落。喜欢

"This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much"

如何 locate/extract 包含短语 "superb product" 的句子并将其放入新列中？

所以对于这种情况，结果将是 expected output

我用过这个，

searched_words=['superb product','SUPERB PRODUCT']


print(df['description'].apply(lambda text: [sent for sent in sent_tokenize(text)
                           if any(True for w in word_tokenize(sent) 
                                     if stemmer.stem(w.lower()) in searched_words)]))

此输出不合适。虽然如果我在“搜索词”列表中只放一个词就可以了。

Answer 1

假设段落被整齐地格式化为带有结尾句点的句子，例如：

for index, paragraph in df['column_name'].iteritems(): for sentence in paragraph.split('.'): if 'superb prod' in sentence: print(sentence) df['extracted_sentence'][index] = sentence

这会很慢，但如果有更好的方法，我想知道。

Answer 2

有很多方法可以做到这一点，@ChootsMagoots 给了你很好的答案，但 SPAcy 也非常有效，你可以简单地选择将引导你到那句话的模式，但在此之前，你可能需要定义一个函数来定义句子这里是代码：


import spacy

def product_sentencizer(doc):
    ''' Look for sentence start tokens by scanning for periods only. '''
    for i, token in enumerate(doc[:-2]):  # The last token cannot start a sentence
        if token.text == ".":
            doc[i+1].is_sent_start = True
        else:
            doc[i+1].is_sent_start = False  # Tell the default sentencizer to ignore this token
    return doc

nlp = spacy.load('en_core_web_sm',  disable=['ner'])
nlp.add_pipe(product_sentencizer, before="parser")  # Insert before the parser can build its own sentences
text = "This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much."
doc = nlp(text)

matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'SUPERB PRODUCT'}] 


matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)         
    print(matched_span.sent)

从具有基于短语的描述列的数据框中提取句子

Extracting sentence from a dataframe with description column based on a phrase

python

regex

nlp

data-mining

nltk