在 Python 3 中对 csv 文件使用正确的编码

Question

我编写了一个带有一个变量 file 的函数，这是一个很大的 .csv 文档。在为一个特定文件（该文件为德语）调用函数后，我立即收到以下错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 4: invalid continuation byte

系统默认编码是utf-8，但是如果我open('C:/Users/me/Desktop/data/myfile.csv')，输出是：

<_io.TextIOWrapper name='C:/Users/me/Desktop/data/myfile.csv' mode='r' encoding='cp1252'>.

从 'str' object has no attribute 'decode' 开始，使用 file.decode('cp1252').encode('utf8') 不起作用，所以我尝试了：

for decodedLine in open('C:/Users/me/Desktop/data/myfile.csv', 'r', encoding='cp1252'):
    line = decodedLine.split('\t')

但是 line 是一个列表对象，我不能 .encode() 它。

如何使 .csv 具有不同编码的文件可读？

Answer 1

我建议尝试使用具有不同编码的 read.csv 函数以 pandas 打开它，直到它正确显示。尝试编码：

import pandas as pd
df = pd.read_csv(r'C:yourpath',encoding = "latin-1")

如果这不起作用，请尝试类似的编码，直到找到为止。

然后您可以根据需要使用正确的编码。

Answer 2

如果我没理解错的话，您有一个 csv 编码为 cp1252 的文件。如果是这种情况，您所要做的就是使用正确的编码打开文件。就 csv 而言，我会使用标准库中的 csv 模块。或者，您可能想查看更专业的库，例如 pandas.

无论如何，要解析你的 csv 你可以这样做：

import csv

with open(filepath, 'r', encoding='cp1252') as file_obj:
    # adjust the parameters according to your file, see docs for more
    csv_obj = csv.reader(file_obj, delimiter='\t', quotechar='"')
    for row in csv_obj:
        # row is a list of entries
        # this would print all entries, separated by commas
        print(', '.join(row))

在 Python 3 中对 csv 文件使用正确的编码

Using the right encoding for csv file in Python 3

python

csv

encoding

decoding

python-3.x