使用像 nltk 这样的 Python 库缩短句子
Shorten Sentence using Python Library like nltk
我正在使用 Nltk
从句子中删除停用词。
例如。 "I would love to fly again via American Airlines"
结果:"Love to fly American Airlines"
我试过以下代码:
# Tokenizing the text
txt = "I love to fly with American Airlines"
stopWords = set(stopwords.words("english"))
words = word_tokenize(txt)
# Creating a frequency table to keep the
# score of each word
freqTable = dict()
for word in words:
word = word.lower()
if word in stopWords:
continue
if word in freqTable:
freqTable[word] += 1
else:
freqTable[word] = 1
# Creating a dictionary to keep the score
# of each sentence
sentences = sent_tokenize(txt)
sentenceValue = dict()
for sentence in sentences:
for word, freq in freqTable.items():
if word in sentence.lower():
if sentence in sentenceValue:
sentenceValue[sentence] += freq
else:
sentenceValue[sentence] = freq
sumValues = 0
for sentence in sentenceValue:
sumValues += sentenceValue[sentence]
# Average value of a sentence from the original text
average = int(sumValues / len(sentenceValue))
# Storing sentences into our summary.
summary = ''
for sentence in sentences:
if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
summary += " " + sentence
print("Summary: " + summary)
这个结果是一个空字符串,因为我认为这个句子太短 Nltk
无法工作。只是研究是否有更简单的方法,我打算为此训练一个模型。
Python 可以通过删除停用词轻松高效地缩短句子的库是 nlkt
,您也在使用它。但是您的方法(逻辑或代码)可能存在一些问题。
下面的代码完美运行
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = "I love to fly with American Airlines"
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
print(" ".join(filtered_sentence))
我正在使用 Nltk
从句子中删除停用词。
例如。 "I would love to fly again via American Airlines"
结果:"Love to fly American Airlines"
我试过以下代码:
# Tokenizing the text
txt = "I love to fly with American Airlines"
stopWords = set(stopwords.words("english"))
words = word_tokenize(txt)
# Creating a frequency table to keep the
# score of each word
freqTable = dict()
for word in words:
word = word.lower()
if word in stopWords:
continue
if word in freqTable:
freqTable[word] += 1
else:
freqTable[word] = 1
# Creating a dictionary to keep the score
# of each sentence
sentences = sent_tokenize(txt)
sentenceValue = dict()
for sentence in sentences:
for word, freq in freqTable.items():
if word in sentence.lower():
if sentence in sentenceValue:
sentenceValue[sentence] += freq
else:
sentenceValue[sentence] = freq
sumValues = 0
for sentence in sentenceValue:
sumValues += sentenceValue[sentence]
# Average value of a sentence from the original text
average = int(sumValues / len(sentenceValue))
# Storing sentences into our summary.
summary = ''
for sentence in sentences:
if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
summary += " " + sentence
print("Summary: " + summary)
这个结果是一个空字符串,因为我认为这个句子太短 Nltk
无法工作。只是研究是否有更简单的方法,我打算为此训练一个模型。
Python 可以通过删除停用词轻松高效地缩短句子的库是 nlkt
,您也在使用它。但是您的方法(逻辑或代码)可能存在一些问题。
下面的代码完美运行
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = "I love to fly with American Airlines"
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
print(" ".join(filtered_sentence))