从 Python 中的网站抓取 zip 文件

Question

我希望有人能帮我弄清楚如何从 this page 抓取数据。我不知道从哪里开始，因为我从来没有在 Python 中使用过抓取或自动下载，但我只是想找到一种方法来自动下载链接页面（和其他页面）上的所有文件喜欢它 - 仅以这个为例）。

链接的文件名中没有明显的模式；它们似乎是随机数，引用了其他地方的 ID 文件名查找 table。

Answer 1

对于以上 URL，前提是您可以按照以下代码下载 zip 文件：

import re
import requests
from bs4 import BeautifulSoup

hostname="http://mis.ercot.com"
r = requests.get(f'{hostname}/misapp/GetReports.do?reportTypeId=13060&reportTitle=Historical%20DAM%20Load%20Zone%20and%20Hub%20Prices&showHTMLView=&mimicKey')
soup = BeautifulSoup(r.text, 'html.parser')
regex = re.compile('.*misdownload/servlets/mirDownload.*')
atgs=soup.findAll("a",{"href":regex})
for link in atgs:
    data=requests.get(f"{hostname}{link['href']}")
    filename=link["href"].split("doclookupId=")[1][:-1]+".zip"
    with open(filename,"wb") as savezip:
        savezip.write(data.content)
    print(filename,"Saved")

如果您有任何问题，请告诉我:)

从 Python 中的网站抓取 zip 文件

Scraping zip files from website in Python

python

scrapy

web-scraping