当 bs4 和其他 python 库不起作用时如何抓取动态网页?
How to scrape Dynamic Web pages when bs4 and other python libraries do not work?
我正在抓取这个网站:https://www.eafo.eu/alternative-fuels/electricity/charging-infra-stats
我无法使用 bs4 或 Selenium 提取动态图表值。我可以获得 html 但没有数据值。当我使用 Selenium 时,我能够捕获 html 但没有数据。有没有什么我想抓住这个或可以操纵动态 web 页面的更强大的工具?
是的,这是一个有趣的问题,在网络抓取数据时实际上可以欺骗很多人...问题是图表是在 JavaScript 文档准备好后加载的,您可以了解更多关于 doc准备就绪 here。但本质上,图表是在所有 HTML、CSS 和 JS 加载后呈现的,并且数据绑定到数据属性。
我创建了一个代码示例,它使用 NodeJS Express 服务器 return JSON 中所有图表中的数据。本质上,它命中 URL,以图表所在的 class 为目标,然后查找包含图表所有数据的 data-* 属性。这样,如果基于 JavaScript 的图表呈现出现这些情况,您将拥有可供使用和分叉的工作代码。
GitHub repo with NodeJS and Python 解决方案: https://github.com/joehoeller/dynamic-chart-parser-for-webscraping
页面上的六个图表中的每一个都填充了来自个人 API 调用的数据,这些数据可以在浏览器的网络设置下找到。您可以自己向这些端点发送请求并解析响应:
import urllib.parse, requests, json
headers = {'authority': 'www.eafo.eu', 'pragma': 'no-cache', 'cache-control': 'no-cache', 'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"', 'accept': 'application/json, text/javascript, */*; q=0.01', 'x-requested-with': 'XMLHttpRequest', 'sec-ch-ua-mobile': '?0', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', 'sec-fetch-site': 'same-origin', 'sec-fetch-mode': 'cors', 'sec-fetch-dest': 'empty', 'referer': 'https://www.eafo.eu/alternative-fuels/electricity/charging-infra-stats', 'accept-language': 'en-US,en;q=0.9', 'cookie': 'yearFilter=2020; activeSubMenu=electricity; subMenuActiveItem=charging_infra_stats; fuelFilter=Electricity; _ga=GA1.2.1782486955.1628797896; _gid=GA1.2.47726291.1628797896; _gat_gtag_UA_129775638_1=1'}
params = (('compare', 'false'),)
urls = ['https://www.eafo.eu/normal-and-fast-charge-points/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/charging-positions-per-10-evs/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/normal-power-charging-positions/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/fillingstations-electricity-top-5/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/fast-charging/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/top-5-countries-charging-positions-per-10-evs/-1/-1/-1/false/false/nvt?compare=false']
data = [[urllib.parse.urlparse(url).path.split('/')[1], json.loads(requests.get(url, headers=headers, params=params).text)] for url in urls]
result = {a:[[i['c'][0]['v'], i['c'][1]['v']] for i in b['data']['rows']] for a, b in data}
输出:
{'normal-and-fast-charge-points': [[2008, 0], [2009, 0], [2010, 0], [2011, 13], [2012, 257], [2013, 751], [2014, 1474], [2015, 3396], [2016, 5190], [2017, 8723], [2018, 11138], [2019, 15136], [2020, 24987]], 'charging-positions-per-10-evs': [['2008', 0], ['2009', 0], ['2010', '14'], ['2011', '6'], ['2012', '3'], ['2013', '4'], ['2014', '5'], ['2015', '5'], ['2016', '5'], ['2017', '5'], ['2018', '6'], ['2019', '7'], ['2020', '9']], 'normal-power-charging-positions': [['2008', 0], ['2009', 0], ['2010', 400], ['2011', 2379], ['2012', 10250], ['2013', 17093], ['2014', 24917], ['2015', 44786], ['2016', 70012], ['2017', 97287], ['2018', 107446], ['2019', 148880], ['2020', 199250]], 'fillingstations-electricity-top-5': [['Netherlands', 66461], ['France', 45413], ['Germany', 43633], ['Sweden', 13564], ['Italy', 13214]], 'fast-charging': [['2008', 0], ['2009', 0], ['2010', 0], ['2011', 13], ['2012', 257], ['2013', 751], ['2014', 1474], ['2015', 3396], ['2016', 5190], ['2017', 8723], ['2018', 11138], ['2019', 15136], ['2020', 24987]], 'top-5-countries-charging-positions-per-10-evs': [['Latvia', '3.15'], ['Slovakia', '4.34'], ['Croatia', '5.14'], ['Estonia', '5.31'], ['Netherlands', '5.71']]}
更简洁的 JSON 格式:
t = {' '.join(map(str.capitalize, a.split('-'))):b for a, b in result.items()}
print(json.dumps(t, indent=4))
输出:
{
"Normal And Fast Charge Points": [
[
2008,
0
],
[
2009,
0
],
[
2010,
0
],
[
2011,
13
],
[
2012,
257
],
[
2013,
751
],
[
2014,
1474
],
[
2015,
3396
],
[
2016,
5190
],
[
2017,
8723
],
[
2018,
11138
],
[
2019,
15136
],
[
2020,
24987
]
],
"Charging Positions Per 10 Evs": [
[
"2008",
0
],
[
"2009",
0
],
[
"2010",
"14"
],
[
"2011",
"6"
],
[
"2012",
"3"
],
[
"2013",
"4"
],
[
"2014",
"5"
],
[
"2015",
"5"
],
[
"2016",
"5"
],
[
"2017",
"5"
],
[
"2018",
"6"
],
[
"2019",
"7"
],
[
"2020",
"9"
]
],
"Normal Power Charging Positions": [
[
"2008",
0
],
[
"2009",
0
],
[
"2010",
400
],
[
"2011",
2379
],
[
"2012",
10250
],
[
"2013",
17093
],
[
"2014",
24917
],
[
"2015",
44786
],
[
"2016",
70012
],
[
"2017",
97287
],
[
"2018",
107446
],
[
"2019",
148880
],
[
"2020",
199250
]
],
"Fillingstations Electricity Top 5": [
[
"Netherlands",
66461
],
[
"France",
45413
],
[
"Germany",
43633
],
[
"Sweden",
13564
],
[
"Italy",
13214
]
],
"Fast Charging": [
[
"2008",
0
],
[
"2009",
0
],
[
"2010",
0
],
[
"2011",
13
],
[
"2012",
257
],
[
"2013",
751
],
[
"2014",
1474
],
[
"2015",
3396
],
[
"2016",
5190
],
[
"2017",
8723
],
[
"2018",
11138
],
[
"2019",
15136
],
[
"2020",
24987
]
],
"Top 5 Countries Charging Positions Per 10 Evs": [
[
"Latvia",
"3.15"
],
[
"Slovakia",
"4.34"
],
[
"Croatia",
"5.14"
],
[
"Estonia",
"5.31"
],
[
"Netherlands",
"5.71"
]
]
}
我正在抓取这个网站:https://www.eafo.eu/alternative-fuels/electricity/charging-infra-stats
我无法使用 bs4 或 Selenium 提取动态图表值。我可以获得 html 但没有数据值。当我使用 Selenium 时,我能够捕获 html 但没有数据。有没有什么我想抓住这个或可以操纵动态 web
是的,这是一个有趣的问题,在网络抓取数据时实际上可以欺骗很多人...问题是图表是在 JavaScript 文档准备好后加载的,您可以了解更多关于 doc准备就绪 here。但本质上,图表是在所有 HTML、CSS 和 JS 加载后呈现的,并且数据绑定到数据属性。
我创建了一个代码示例,它使用 NodeJS Express 服务器 return JSON 中所有图表中的数据。本质上,它命中 URL,以图表所在的 class 为目标,然后查找包含图表所有数据的 data-* 属性。这样,如果基于 JavaScript 的图表呈现出现这些情况,您将拥有可供使用和分叉的工作代码。
GitHub repo with NodeJS and Python 解决方案: https://github.com/joehoeller/dynamic-chart-parser-for-webscraping
页面上的六个图表中的每一个都填充了来自个人 API 调用的数据,这些数据可以在浏览器的网络设置下找到。您可以自己向这些端点发送请求并解析响应:
import urllib.parse, requests, json
headers = {'authority': 'www.eafo.eu', 'pragma': 'no-cache', 'cache-control': 'no-cache', 'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"', 'accept': 'application/json, text/javascript, */*; q=0.01', 'x-requested-with': 'XMLHttpRequest', 'sec-ch-ua-mobile': '?0', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', 'sec-fetch-site': 'same-origin', 'sec-fetch-mode': 'cors', 'sec-fetch-dest': 'empty', 'referer': 'https://www.eafo.eu/alternative-fuels/electricity/charging-infra-stats', 'accept-language': 'en-US,en;q=0.9', 'cookie': 'yearFilter=2020; activeSubMenu=electricity; subMenuActiveItem=charging_infra_stats; fuelFilter=Electricity; _ga=GA1.2.1782486955.1628797896; _gid=GA1.2.47726291.1628797896; _gat_gtag_UA_129775638_1=1'}
params = (('compare', 'false'),)
urls = ['https://www.eafo.eu/normal-and-fast-charge-points/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/charging-positions-per-10-evs/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/normal-power-charging-positions/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/fillingstations-electricity-top-5/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/fast-charging/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/top-5-countries-charging-positions-per-10-evs/-1/-1/-1/false/false/nvt?compare=false']
data = [[urllib.parse.urlparse(url).path.split('/')[1], json.loads(requests.get(url, headers=headers, params=params).text)] for url in urls]
result = {a:[[i['c'][0]['v'], i['c'][1]['v']] for i in b['data']['rows']] for a, b in data}
输出:
{'normal-and-fast-charge-points': [[2008, 0], [2009, 0], [2010, 0], [2011, 13], [2012, 257], [2013, 751], [2014, 1474], [2015, 3396], [2016, 5190], [2017, 8723], [2018, 11138], [2019, 15136], [2020, 24987]], 'charging-positions-per-10-evs': [['2008', 0], ['2009', 0], ['2010', '14'], ['2011', '6'], ['2012', '3'], ['2013', '4'], ['2014', '5'], ['2015', '5'], ['2016', '5'], ['2017', '5'], ['2018', '6'], ['2019', '7'], ['2020', '9']], 'normal-power-charging-positions': [['2008', 0], ['2009', 0], ['2010', 400], ['2011', 2379], ['2012', 10250], ['2013', 17093], ['2014', 24917], ['2015', 44786], ['2016', 70012], ['2017', 97287], ['2018', 107446], ['2019', 148880], ['2020', 199250]], 'fillingstations-electricity-top-5': [['Netherlands', 66461], ['France', 45413], ['Germany', 43633], ['Sweden', 13564], ['Italy', 13214]], 'fast-charging': [['2008', 0], ['2009', 0], ['2010', 0], ['2011', 13], ['2012', 257], ['2013', 751], ['2014', 1474], ['2015', 3396], ['2016', 5190], ['2017', 8723], ['2018', 11138], ['2019', 15136], ['2020', 24987]], 'top-5-countries-charging-positions-per-10-evs': [['Latvia', '3.15'], ['Slovakia', '4.34'], ['Croatia', '5.14'], ['Estonia', '5.31'], ['Netherlands', '5.71']]}
更简洁的 JSON 格式:
t = {' '.join(map(str.capitalize, a.split('-'))):b for a, b in result.items()}
print(json.dumps(t, indent=4))
输出:
{
"Normal And Fast Charge Points": [
[
2008,
0
],
[
2009,
0
],
[
2010,
0
],
[
2011,
13
],
[
2012,
257
],
[
2013,
751
],
[
2014,
1474
],
[
2015,
3396
],
[
2016,
5190
],
[
2017,
8723
],
[
2018,
11138
],
[
2019,
15136
],
[
2020,
24987
]
],
"Charging Positions Per 10 Evs": [
[
"2008",
0
],
[
"2009",
0
],
[
"2010",
"14"
],
[
"2011",
"6"
],
[
"2012",
"3"
],
[
"2013",
"4"
],
[
"2014",
"5"
],
[
"2015",
"5"
],
[
"2016",
"5"
],
[
"2017",
"5"
],
[
"2018",
"6"
],
[
"2019",
"7"
],
[
"2020",
"9"
]
],
"Normal Power Charging Positions": [
[
"2008",
0
],
[
"2009",
0
],
[
"2010",
400
],
[
"2011",
2379
],
[
"2012",
10250
],
[
"2013",
17093
],
[
"2014",
24917
],
[
"2015",
44786
],
[
"2016",
70012
],
[
"2017",
97287
],
[
"2018",
107446
],
[
"2019",
148880
],
[
"2020",
199250
]
],
"Fillingstations Electricity Top 5": [
[
"Netherlands",
66461
],
[
"France",
45413
],
[
"Germany",
43633
],
[
"Sweden",
13564
],
[
"Italy",
13214
]
],
"Fast Charging": [
[
"2008",
0
],
[
"2009",
0
],
[
"2010",
0
],
[
"2011",
13
],
[
"2012",
257
],
[
"2013",
751
],
[
"2014",
1474
],
[
"2015",
3396
],
[
"2016",
5190
],
[
"2017",
8723
],
[
"2018",
11138
],
[
"2019",
15136
],
[
"2020",
24987
]
],
"Top 5 Countries Charging Positions Per 10 Evs": [
[
"Latvia",
"3.15"
],
[
"Slovakia",
"4.34"
],
[
"Croatia",
"5.14"
],
[
"Estonia",
"5.31"
],
[
"Netherlands",
"5.71"
]
]
}