TypeError: a bytes-like object is required, not 'str': even with the encode
TypeError: a bytes-like object is required, not 'str': even with the encode
我只是想打印我的脚本。我有这个问题,我已经研究并阅读了很多答案,甚至添加 .encode ('utf-8) 仍然不起作用。
import pandas
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
n_components = 30
n_top_words = 10
def print_top_words(model, feature_names, n_top_words):]
for topic_idx, topic in enumerate(model.components_):
message = "Topic #%d: " % topic_idx
message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
return message
text = pandas.read_csv('fr_pretraitement.csv', encoding = 'utf-8')
text_clean = text['liste2']
text_raw = text['liste1']
text_clean_non_empty = text_clean.dropna()
not_commas = text_raw.str.replace(',', '')
text_raw_list = not_commas.values.tolist()
text_clean_list = text_clean_non_empty.values.tolist()
tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(text_clean_list)
tf_feature_names = tf_vectorizer.get_feature_names()
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0)
lda.fit(tf)
print('topics...')
print(print_top_words(lda, tf_feature_names, n_top_words))
document_topics = lda.fit_transform(tf)
topics = print_top_words(lda, tf_feature_names, n_top_words)
for i in range(len(topics)):
print("Topic {}:".format(i))
docs = np.argsort(document_topics[:, i])[::-1]
for j in docs[:300]:
cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])
print(cleans.encode('utf-8') + ',' + " ".join(text_raw_list[j].encode('utf-8').split(",")[:2]))
我的输出:
Traceback (most recent call last):
File "script.py", line 62, in
cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])
TypeError: a bytes-like object is required, not 'str'
cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])
您正在将 text_clean_list[j] 中的 string 编码为 bytes 但是 split(",") 呢?
"," 仍然是 str。现在您正在尝试使用字符串拆分像对象一样的字节。
示例:
a = "this,that"
>>> a.encode('utf-8').split(',')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: a bytes-like object is required, not 'str'
编辑
解决方法:
1- 一种解决方案可能是现在不对字符串对象进行编码,先拆分然后再编码。就像我的例子:
a = "this, that"
c = a.split(",")
cleans = [x.encode('utf-8') for x in c]
2- 只需使用“,”本身的简单编码。
cleans = a.encode("utf-8").split("b")
两者的答案相同。如果你能想出输入和输出的例子就更好了。
让我们看一下引发错误的行:
cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])
让我们一步步来:
text_clean_list[j]
是 str 类型 => 没有错误直到
text_clean_list[j].encode('utf-8')
是 bytes 类型 => 没有错误直到那里
text_clean_list[j].encode('utf-8').split(",")
错误:传递给 split() 方法的参数 "," 是 str 类型,但它必须是 bytes 类型(因为这里 split() 是来自 bytes 对象的方法)=> 引发错误,表明 a bytes-like object is required, not 'str'
.
注意:将split(",")
替换为split(b",")
可以避免错误(但它可能不是您期望的行为...)
我只是想打印我的脚本。我有这个问题,我已经研究并阅读了很多答案,甚至添加 .encode ('utf-8) 仍然不起作用。
import pandas
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
n_components = 30
n_top_words = 10
def print_top_words(model, feature_names, n_top_words):]
for topic_idx, topic in enumerate(model.components_):
message = "Topic #%d: " % topic_idx
message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
return message
text = pandas.read_csv('fr_pretraitement.csv', encoding = 'utf-8')
text_clean = text['liste2']
text_raw = text['liste1']
text_clean_non_empty = text_clean.dropna()
not_commas = text_raw.str.replace(',', '')
text_raw_list = not_commas.values.tolist()
text_clean_list = text_clean_non_empty.values.tolist()
tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(text_clean_list)
tf_feature_names = tf_vectorizer.get_feature_names()
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0)
lda.fit(tf)
print('topics...')
print(print_top_words(lda, tf_feature_names, n_top_words))
document_topics = lda.fit_transform(tf)
topics = print_top_words(lda, tf_feature_names, n_top_words)
for i in range(len(topics)):
print("Topic {}:".format(i))
docs = np.argsort(document_topics[:, i])[::-1]
for j in docs[:300]:
cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])
print(cleans.encode('utf-8') + ',' + " ".join(text_raw_list[j].encode('utf-8').split(",")[:2]))
我的输出:
Traceback (most recent call last):
File "script.py", line 62, in
cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])
TypeError: a bytes-like object is required, not 'str'
cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])
您正在将 text_clean_list[j] 中的 string 编码为 bytes 但是 split(",") 呢?
"," 仍然是 str。现在您正在尝试使用字符串拆分像对象一样的字节。
示例:
a = "this,that"
>>> a.encode('utf-8').split(',')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: a bytes-like object is required, not 'str'
编辑
解决方法: 1- 一种解决方案可能是现在不对字符串对象进行编码,先拆分然后再编码。就像我的例子:
a = "this, that"
c = a.split(",")
cleans = [x.encode('utf-8') for x in c]
2- 只需使用“,”本身的简单编码。
cleans = a.encode("utf-8").split("b")
两者的答案相同。如果你能想出输入和输出的例子就更好了。
让我们看一下引发错误的行:
cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])
让我们一步步来:
text_clean_list[j]
是 str 类型 => 没有错误直到text_clean_list[j].encode('utf-8')
是 bytes 类型 => 没有错误直到那里text_clean_list[j].encode('utf-8').split(",")
错误:传递给 split() 方法的参数 "," 是 str 类型,但它必须是 bytes 类型(因为这里 split() 是来自 bytes 对象的方法)=> 引发错误,表明a bytes-like object is required, not 'str'
.
注意:将split(",")
替换为split(b",")
可以避免错误(但它可能不是您期望的行为...)