从内存中读取抓取的 csv 文件时出现解码问题

Question

我有一个用 scrapy 制作的网络爬虫，它遍历网页，下载几个 CSV/TXT/ZIP 文件并解析文件中的数据以获取 scrapy 项目。这些文件没有保存在磁盘中，它们保留在内存中，因为解析后我不需要它们。

准确地说，这些文件是 .txt 或 .zip，其中包含 .txt，但是它们是逗号分隔的，所以我将它们作为 csv 处理。它是这样工作的：

import csv
import io
import zipfile

headers = ['list', 'of strings', 'with headers names']

def parse(self, response, ftype):
    if ftype == 'zip':
        zip_file = zipfile.ZipFile(io.BytesIO(response.body))
        file = io.TextIOWrapper(zip_file.open(zip_file.namelist()[0]))
    else: #If file was .txt
        file = io.StringIO(response.text)

    reader = csv.DictReader(file, fieldnames=headers)
    for row in reader:
        yield self.parse_row(row)

所有文件都成功打开，但有些文件在 reader 迭代期间引发 UnicodeDecodeError 。（他们阅读了错误之前的行 - 所有问题最初都与文件有关 .zip）

异常显示：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 7123: invalid start byte.

[字节 0x8a 也发生]

我不知道该怎么办。有没有办法使用 csv.DictReader 或 io 以不同的编码读取这些文件？

我正在寻找最好不涉及第三方依赖项的解决方案（意思是，不包含在 Python 标准库中），即使这意味着更难做到。

Answer 1

问题是您的 zip 文件中的某些文件的编码不是 UTF-8。这是正在发生的事情的简化示例。

>>> # Make a string of csv-like rows.
>>> rows = 'h1,h2\nhello,world\nßäæ,öë\n'
>>> # Encode the data with an encoding that isn't UTF-8
>>> # (cp1252 is common on Windows machines)  
>>> bs = rows.encode('cp1252')
>>> # Load the encoded bytes into a file-like object
>>> bio = io.BytesIO(bs)                  
>>> bio.seek(0)                          
0
>>> # Load the file-like object into a TextIOWrapper
>>> w = io.TextIOWrapper(bio)            
>>> w.seek(0)                            
0
>>> # Pass the TextIOWrapper to a csv reader and read it
>>> reader = csv.reader(w)               
>>> for row in reader:print(row)         
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kev/virtual-envs/so38/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 18: invalid continuation byte

解决方案是将 encoding 参数传递给 TextIOWrapper 以便正确解码数据：

>>> bio = io.BytesIO(bs)
>>> bio.seek(0)
0
>>> # Tell TextIOWrapper these bytes are cp1252!
>>> w = io.TextIOWrapper(bio, encoding='cp1252')
>>> w.seek(0)
0
>>> reader = csv.reader(w)
>>> for row in reader:print(row)
... 
['h1', 'h2']
['hello', 'world']
['ßäæ', 'öë']

不过还有另一个问题 - 您需要知道要传递给 TextIOWrapper 的编码。不幸的是，没有 100% 确定文件编码的方法。您可能会猜到（所有这些文件都来自 Windows 英语国家 ^* 国家的用户，因此 cp1252 是一个可能的解决方案），或者您可以使用诸如 chardet为你猜猜

^* 标准库中的 codecs 模块有一个可用编解码器列表 Python 以及它们相关的人类语言。

从内存中读取抓取的 csv 文件时出现解码问题

Decoding issue while reading scraped csv files from memory

python

decode

character-encoding

scrapy

web-scraping