python 和 json UTF-8 编码

Question

我目前遇到一些关于编码的问题。因为我是法国人，所以我经常使用 é 或 è.

这样的字符

我想弄清楚为什么它们没有显示在我用 scrapy...

自动创建的 JSON 文件中

这是我的 python 代码：

# -*- coding: utf-8 -*-

import scrapy


class BlogSpider(scrapy.Spider):
    name = 'pokespider'
    start_urls = [
        "https://www.pokepedia.fr/Liste_des_Pok%C3%A9mon_par_apport_en_EV"]

    def parse(self, response):
        for poke in response.css('table.tableaustandard.sortable tr')[1:]:
            num = poke.css('td ::text').extract_first()
            nom = poke.css('td:nth-child(3) a ::text').extract_first()

            yield {'numero': int(num), 'nom': nom}

然后，在键入 scrapy 命令后，代码会生成一个 JSON 文件。这是它的第一行：

[
{"numero": 1, "nom": "Bulbizarre"},
{"numero": 2, "nom": "Herbizarre"},
{"numero": 3, "nom": "Florizarre"},
{"numero": 4, "nom": "Salam\u00e8che"},
...
]

（是的，这些是法国神奇宝贝的名字。）

所以，我想去掉这个 \u00e8 字符，它应该是一个 è... 有办法吗？

提前谢谢你，希望我的英语不会太差:)

Answer 1

使用 FEED_EXPORT_ENCODING 选项：此处 custom_settings。

import scrapy
  
class BlogSpider(scrapy.Spider):
    name = 'pokespider'
    custom_settings = {'FEED_EXPORT_ENCODING': 'utf-8'}
    start_urls = [
        "https://www.pokepedia.fr/Liste_des_Pok%C3%A9mon_par_apport_en_EV"]

    def parse(self, response):
        for poke in response.css('table.tableaustandard.sortable tr')[1:]:
            num = poke.css('td ::text').extract_first()
            nom = poke.css('td:nth-child(3) a ::text').extract_first()

            yield {'numero': int(num), 'nom': nom}

process = CrawlerProcess(settings={
    "FEEDS": {
        "items_json": {"format": "json"},
    },
})

process.crawl(BlogSpider)
process.start()

python 和 json UTF-8 编码

python and json UTF-8 encoding

python

json

utf-8

scrapy