当 bs4 和其他 python 库不起作用时如何抓取动态网页？

Question

我正在抓取这个网站：https://www.eafo.eu/alternative-fuels/electricity/charging-infra-stats

我无法使用 bs4 或 Selenium 提取动态图表值。我可以获得 html 但没有数据值。当我使用 Selenium 时，我能够捕获 html 但没有数据。有没有什么我想抓住这个或可以操纵动态 web 页面的更强大的工具？

Answer 1

是的，这是一个有趣的问题，在网络抓取数据时实际上可以欺骗很多人...问题是图表是在 JavaScript 文档准备好后加载的，您可以了解更多关于 doc准备就绪 here。但本质上，图表是在所有 HTML、CSS 和 JS 加载后呈现的，并且数据绑定到数据属性。

我创建了一个代码示例，它使用 NodeJS Express 服务器 return JSON 中所有图表中的数据。本质上，它命中 URL，以图表所在的 class 为目标，然后查找包含图表所有数据的 data-* 属性。这样，如果基于 JavaScript 的图表呈现出现这些情况，您将拥有可供使用和分叉的工作代码。

GitHub repo with NodeJS and Python 解决方案： https://github.com/joehoeller/dynamic-chart-parser-for-webscraping

Answer 2

页面上的六个图表中的每一个都填充了来自个人 API 调用的数据，这些数据可以在浏览器的网络设置下找到。您可以自己向这些端点发送请求并解析响应：

import urllib.parse, requests, json
headers = {'authority': 'www.eafo.eu', 'pragma': 'no-cache', 'cache-control': 'no-cache', 'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"', 'accept': 'application/json, text/javascript, */*; q=0.01', 'x-requested-with': 'XMLHttpRequest', 'sec-ch-ua-mobile': '?0', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', 'sec-fetch-site': 'same-origin', 'sec-fetch-mode': 'cors', 'sec-fetch-dest': 'empty', 'referer': 'https://www.eafo.eu/alternative-fuels/electricity/charging-infra-stats', 'accept-language': 'en-US,en;q=0.9', 'cookie': 'yearFilter=2020; activeSubMenu=electricity; subMenuActiveItem=charging_infra_stats; fuelFilter=Electricity; _ga=GA1.2.1782486955.1628797896; _gid=GA1.2.47726291.1628797896; _gat_gtag_UA_129775638_1=1'}
params = (('compare', 'false'),)
urls = ['https://www.eafo.eu/normal-and-fast-charge-points/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/charging-positions-per-10-evs/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/normal-power-charging-positions/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/fillingstations-electricity-top-5/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/fast-charging/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/top-5-countries-charging-positions-per-10-evs/-1/-1/-1/false/false/nvt?compare=false'] 
data = [[urllib.parse.urlparse(url).path.split('/')[1], json.loads(requests.get(url, headers=headers, params=params).text)] for url in urls]
result = {a:[[i['c'][0]['v'], i['c'][1]['v']] for i in b['data']['rows']] for a, b in data}

输出：

{'normal-and-fast-charge-points': [[2008, 0], [2009, 0], [2010, 0], [2011, 13], [2012, 257], [2013, 751], [2014, 1474], [2015, 3396], [2016, 5190], [2017, 8723], [2018, 11138], [2019, 15136], [2020, 24987]], 'charging-positions-per-10-evs': [['2008', 0], ['2009', 0], ['2010', '14'], ['2011', '6'], ['2012', '3'], ['2013', '4'], ['2014', '5'], ['2015', '5'], ['2016', '5'], ['2017', '5'], ['2018', '6'], ['2019', '7'], ['2020', '9']], 'normal-power-charging-positions': [['2008', 0], ['2009', 0], ['2010', 400], ['2011', 2379], ['2012', 10250], ['2013', 17093], ['2014', 24917], ['2015', 44786], ['2016', 70012], ['2017', 97287], ['2018', 107446], ['2019', 148880], ['2020', 199250]], 'fillingstations-electricity-top-5': [['Netherlands', 66461], ['France', 45413], ['Germany', 43633], ['Sweden', 13564], ['Italy', 13214]], 'fast-charging': [['2008', 0], ['2009', 0], ['2010', 0], ['2011', 13], ['2012', 257], ['2013', 751], ['2014', 1474], ['2015', 3396], ['2016', 5190], ['2017', 8723], ['2018', 11138], ['2019', 15136], ['2020', 24987]], 'top-5-countries-charging-positions-per-10-evs': [['Latvia', '3.15'], ['Slovakia', '4.34'], ['Croatia', '5.14'], ['Estonia', '5.31'], ['Netherlands', '5.71']]}

更简洁的 JSON 格式：

t = {' '.join(map(str.capitalize, a.split('-'))):b for a, b in result.items()}
print(json.dumps(t, indent=4))

输出：

{
    "Normal And Fast Charge Points": [
        [
            2008,
            0
        ],
        [
            2009,
            0
        ],
        [
            2010,
            0
        ],
        [
            2011,
            13
        ],
        [
            2012,
            257
        ],
        [
            2013,
            751
        ],
        [
            2014,
            1474
        ],
        [
            2015,
            3396
        ],
        [
            2016,
            5190
        ],
        [
            2017,
            8723
        ],
        [
            2018,
            11138
        ],
        [
            2019,
            15136
        ],
        [
            2020,
            24987
        ]
    ],
    "Charging Positions Per 10 Evs": [
        [
            "2008",
            0
        ],
        [
            "2009",
            0
        ],
        [
            "2010",
            "14"
        ],
        [
            "2011",
            "6"
        ],
        [
            "2012",
            "3"
        ],
        [
            "2013",
            "4"
        ],
        [
            "2014",
            "5"
        ],
        [
            "2015",
            "5"
        ],
        [
            "2016",
            "5"
        ],
        [
            "2017",
            "5"
        ],
        [
            "2018",
            "6"
        ],
        [
            "2019",
            "7"
        ],
        [
            "2020",
            "9"
        ]
    ],
    "Normal Power Charging Positions": [
        [
            "2008",
            0
        ],
        [
            "2009",
            0
        ],
        [
            "2010",
            400
        ],
        [
            "2011",
            2379
        ],
        [
            "2012",
            10250
        ],
        [
            "2013",
            17093
        ],
        [
            "2014",
            24917
        ],
        [
            "2015",
            44786
        ],
        [
            "2016",
            70012
        ],
        [
            "2017",
            97287
        ],
        [
            "2018",
            107446
        ],
        [
            "2019",
            148880
        ],
        [
            "2020",
            199250
        ]
    ],
    "Fillingstations Electricity Top 5": [
        [
            "Netherlands",
            66461
        ],
        [
            "France",
            45413
        ],
        [
            "Germany",
            43633
        ],
        [
            "Sweden",
            13564
        ],
        [
            "Italy",
            13214
        ]
    ],
    "Fast Charging": [
        [
            "2008",
            0
        ],
        [
            "2009",
            0
        ],
        [
            "2010",
            0
        ],
        [
            "2011",
            13
        ],
        [
            "2012",
            257
        ],
        [
            "2013",
            751
        ],
        [
            "2014",
            1474
        ],
        [
            "2015",
            3396
        ],
        [
            "2016",
            5190
        ],
        [
            "2017",
            8723
        ],
        [
            "2018",
            11138
        ],
        [
            "2019",
            15136
        ],
        [
            "2020",
            24987
        ]
    ],
    "Top 5 Countries Charging Positions Per 10 Evs": [
        [
            "Latvia",
            "3.15"
        ],
        [
            "Slovakia",
            "4.34"
        ],
        [
            "Croatia",
            "5.14"
        ],
        [
            "Estonia",
            "5.31"
        ],
        [
            "Netherlands",
            "5.71"
        ]
    ]
}

当 bs4 和其他 python 库不起作用时如何抓取动态网页？

How to scrape Dynamic Web pages when bs4 and other python libraries do not work?

html

css

python

automation