如何读取 Python 中的 csv 文件（带有特殊字符）？如何解码文本数据？从文件中读取编码文本并转换为字符串

Question

我使用 Python csv.writer() 使用 tweepy 将推文文本存储在 csv 文件中，但我必须在存储之前将文本编码为 utf-8，否则 tweepy 会抛出一个奇怪的错误。

导入 pandas 作为 pd

数据 = pd.read_csv('C:\Users\Lenovo\Desktop\_Carabinieri_10_tweets.csv', delimiter=",", encoding="utf-8")

data.head()

打印(data.head())

现在，文本数据是这样存储的：

输出

id … 文本

0 1228280254256623616 … b'RT @MinisteroDifesa: #14febbraio Il Ministro…

1 1228257366841405441 … b'\xe2\x80\x9cNon t\xe2\x80\x99ama 我爱你…

2 1228235394954620928 … b'Eseguite dai #Carabinieri del Nucleo Investi…

3 1228219588589965316 … b'Il pianeta brucia\nConosci il black carbon?...

4 1228020579485261824 … b'RT @Coninews: Emozioni tricolore \xe2\x9c\xa…

虽然我使用 "utf-8" 使用下面显示的代码将文件读入 DataFrame，但输出的字符看起来非常不同。输出看起来像字节。语言是意大利语。

我尝试使用此代码对此进行解码（其他列中有更多数据，文本在第二列中）。但是，它不会解码文本。我不能使用 .decode('utf-8') 因为 csv reader 将数据读取为字符串，即 type(row[2]) 是 'str' 并且我似乎无法将其转换为字节，数据再次被编码！

如何解码文本数据？

如果你能帮助我，我将非常高兴，在此先感谢你。

Answer 1

问题可能出在您编写 csv 文件的方式上。我敢打赌，当以文本形式阅读时（使用记事本、notepad++ 或 vi 等简单的文本编辑器）实际上包含：

1228280254256623616,…,b'RT @MinisteroDifesa: #14febbraio Il Ministro...'
1228257366841405441,…,b'\xe2\x80\x9cNon t\xe2\x80\x99ama chi amor ti...'
...

或：

1228280254256623616,…,"b'RT @MinisteroDifesa: #14febbraio Il Ministro...'"
1228257366841405441,…,"b'\xe2\x80\x9cNon t\xe2\x80\x99ama chi amor ti...'"
...

Pandas read_csv 然后正确读取 字节字符串的文本表示.

正确的解决方法是编写真正的 UTF-8 编码字符串，但由于我不知道代码，因此无法提出解决方法。

一种可能的解决方法是使用 ast.literal_eval 将文本表示形式转换为字节字符串并对其进行解码：

df['text'] = df['text'].apply(lambda x: ast.literal_eval(x).decode('utf8'))

它应该给出：

                    id ... text
0  1228280254256623616 ... RT @MinisteroDifesa: #14febbraio Il Ministro...
1  1228257366841405441 ... “Non t’ama chi amor ti...
...

如何读取 Python 中的 csv 文件（带有特殊字符）？如何解码文本数据？从文件中读取编码文本并转换为字符串

How to read csv files (with special characters) in Python? How can I decode the text data? Read encoded text from file and convert to string

python

utf-8

tweepy