如何创建一个标记化和词干化的函数
how to create a function that tokenizes and stems the words
我的代码
def tokenize_and_stem(text):
tokens = [sent for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(text)]
filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]
stems = stemmer.stem(filtered_tokens)
words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
print(words_stemmed)
我遇到了这个错误
AttributeError Traceback(最后一次调用)
在
13 return 个词干
14
---> 15 words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
16 打印(words_stemmed)
在 tokenize_and_stem(文本)
9
10 # 阻止 filtered_tokens
---> 11 个词干 = stemmer.stem(filtered_tokens)
12
13 return 茎
/usr/local/lib/python3.6/dist-packages/nltk/stem/snowball.py in stem(self, word)
1415
第1416章
-> 1417 字 = word.lower()
1418
1419 如果字在 self.stopwords 或 len(word) <= 2:
AttributeError: 'list' 对象没有属性 'lower'
import nltk
import string
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
def tokenize_and_stem(text):
tokens = nltk.tokenize.word_tokenize(text)
# strip out punctuation and make lowercase
tokens = [token.lower().strip(string.punctuation)
for token in tokens if token.isalnum()]
# now stem the tokens
tokens = [stemmer.stem(token) for token in tokens]
return tokens
tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
输出:
['today', 'may', '19', '2016', 'is', 'hi', 'onli', 'daughter', 'wed']
你的代码
def tokenize_and_stem(text):
tokens = [sent for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(text)]
filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]
stems = stemmer.stem(filtered_tokens)
words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's
wedding.")
print(words_stemmed)
错误显示“”"word = word.lower()... if word in self.stopwords or len(word) <= 2: list object has no attribute 'lower'"”
错误不仅是因为 .lower() 而是因为长度
如果您尝试 运行 它而不更改第 5 行的 filtered_tokens,
不改变意味着使用你的。
你不会得到任何错误,但输出将是这样的:
["today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding."]
这是您的固定密码。
def tokenize_and_stem(text):
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]
stems = [stemmer.stem(t) for t in filtered_tokens if len(t) > 0]
return stems
words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
print(words_stemmed)
所以,我只更改了第 3 行和第 7 行
我的代码
def tokenize_and_stem(text):
tokens = [sent for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(text)]
filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]
stems = stemmer.stem(filtered_tokens)
words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
print(words_stemmed)
我遇到了这个错误
AttributeError Traceback(最后一次调用) 在 13 return 个词干 14 ---> 15 words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.") 16 打印(words_stemmed)
在 tokenize_and_stem(文本)
9
10 # 阻止 filtered_tokens
---> 11 个词干 = stemmer.stem(filtered_tokens)
12
13 return 茎
/usr/local/lib/python3.6/dist-packages/nltk/stem/snowball.py in stem(self, word) 1415 第1416章 -> 1417 字 = word.lower() 1418 1419 如果字在 self.stopwords 或 len(word) <= 2:
AttributeError: 'list' 对象没有属性 'lower'
import nltk
import string
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
def tokenize_and_stem(text):
tokens = nltk.tokenize.word_tokenize(text)
# strip out punctuation and make lowercase
tokens = [token.lower().strip(string.punctuation)
for token in tokens if token.isalnum()]
# now stem the tokens
tokens = [stemmer.stem(token) for token in tokens]
return tokens
tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
输出:
['today', 'may', '19', '2016', 'is', 'hi', 'onli', 'daughter', 'wed']
你的代码
def tokenize_and_stem(text):
tokens = [sent for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(text)]
filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]
stems = stemmer.stem(filtered_tokens)
words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's
wedding.")
print(words_stemmed)
错误显示“”"word = word.lower()... if word in self.stopwords or len(word) <= 2: list object has no attribute 'lower'"”
错误不仅是因为 .lower() 而是因为长度 如果您尝试 运行 它而不更改第 5 行的 filtered_tokens, 不改变意味着使用你的。 你不会得到任何错误,但输出将是这样的:
["today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding."]
这是您的固定密码。
def tokenize_and_stem(text):
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]
stems = [stemmer.stem(t) for t in filtered_tokens if len(t) > 0]
return stems
words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
print(words_stemmed)
所以,我只更改了第 3 行和第 7 行