从具有基于短语的描述列的数据框中提取句子
Extracting sentence from a dataframe with description column based on a phrase
我有一个包含产品详细信息的 'description' 列的数据框。专栏中的每个描述都有很长的段落。喜欢
"This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much"
如何 locate/extract 包含短语 "superb product" 的句子并将其放入新列中?
所以对于这种情况,结果将是
expected output
我用过这个,
searched_words=['superb product','SUPERB PRODUCT']
print(df['description'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if stemmer.stem(w.lower()) in searched_words)]))
此输出不合适。虽然如果我在“搜索词”列表中只放一个词就可以了。
假设段落被整齐地格式化为带有结尾句点的句子,例如:
for index, paragraph in df['column_name'].iteritems():
for sentence in paragraph.split('.'):
if 'superb prod' in sentence:
print(sentence)
df['extracted_sentence'][index] = sentence
这会很慢,但如果有更好的方法,我想知道。
有很多方法可以做到这一点,@ChootsMagoots 给了你很好的答案,但 SPAcy 也非常有效,你可以简单地选择将引导你到那句话的模式,但在此之前,你可能需要定义一个函数来定义句子这里是代码:
import spacy
def product_sentencizer(doc):
''' Look for sentence start tokens by scanning for periods only. '''
for i, token in enumerate(doc[:-2]): # The last token cannot start a sentence
if token.text == ".":
doc[i+1].is_sent_start = True
else:
doc[i+1].is_sent_start = False # Tell the default sentencizer to ignore this token
return doc
nlp = spacy.load('en_core_web_sm', disable=['ner'])
nlp.add_pipe(product_sentencizer, before="parser") # Insert before the parser can build its own sentences
text = "This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much."
doc = nlp(text)
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'SUPERB PRODUCT'}]
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.text)
print(matched_span.sent)
我有一个包含产品详细信息的 'description' 列的数据框。专栏中的每个描述都有很长的段落。喜欢
"This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much"
如何 locate/extract 包含短语 "superb product" 的句子并将其放入新列中?
所以对于这种情况,结果将是 expected output
我用过这个,
searched_words=['superb product','SUPERB PRODUCT']
print(df['description'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if stemmer.stem(w.lower()) in searched_words)]))
此输出不合适。虽然如果我在“搜索词”列表中只放一个词就可以了。
假设段落被整齐地格式化为带有结尾句点的句子,例如:
for index, paragraph in df['column_name'].iteritems():
for sentence in paragraph.split('.'):
if 'superb prod' in sentence:
print(sentence)
df['extracted_sentence'][index] = sentence
这会很慢,但如果有更好的方法,我想知道。
有很多方法可以做到这一点,@ChootsMagoots 给了你很好的答案,但 SPAcy 也非常有效,你可以简单地选择将引导你到那句话的模式,但在此之前,你可能需要定义一个函数来定义句子这里是代码:
import spacy
def product_sentencizer(doc):
''' Look for sentence start tokens by scanning for periods only. '''
for i, token in enumerate(doc[:-2]): # The last token cannot start a sentence
if token.text == ".":
doc[i+1].is_sent_start = True
else:
doc[i+1].is_sent_start = False # Tell the default sentencizer to ignore this token
return doc
nlp = spacy.load('en_core_web_sm', disable=['ner'])
nlp.add_pipe(product_sentencizer, before="parser") # Insert before the parser can build its own sentences
text = "This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much."
doc = nlp(text)
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'SUPERB PRODUCT'}]
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.text)
print(matched_span.sent)