python3 下载 Beautiful Soup 中最新的 link
python3 download newest link in Beautiful Soup
在我的 python 脚本中,我加载了一个包含 Beautiful Soup 的网页。如何才能只下载最新(最新)的文件?
<a href="BAGGEM0498L-15012021.zip">BAGGEM0498L-15012021.zip</a> 2021-01-19 06:56 3.6M
<a href="BAGGEM0498L-15022021.zip">BAGGEM0498L-15022021.zip</a> 2021-02-15 21:57 3.6M
<a href="BAGGEM0498L-15102020.zip">BAGGEM0498L-15102020.zip</a> 2020-10-24 03:19 3.6M
<a href="BAGGEM0498L-15112020.zip">BAGGEM0498L-15112020.zip</a> 2020-11-15 15:02 3.6M
<a href="BAGGEM0498L-15122020.zip">BAGGEM0498L-15122020.zip</a> 2020-12-15 13:48 3.6M
页面的实际url是https://extracten.bag.kadaster.nl/lvbag/extracten/Gemeente LVC/0498/
如果您使用文件名来决定顺序,那么您首先需要提取日期并将其转换为 datetime
对象。构建文件名列表,然后使用此日期对它们进行排序。例如:
from bs4 import BeautifulSoup
from datetime import datetime
html = """<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<title>Index of /lvbag/extracten/Gemeente LVC/0498</title>
</head>
<body>
<h1>Index of /lvbag/extracten/Gemeente LVC/0498</h1>
<pre> <a href="?C=N;O=D">Name</a> <a href="?C=M;O=A">Last modified</a> <a href="?C=S;O=A">Size</a> <hr> <a href="/lvbag/extracten/Gemeente%20LVC/">Parent Directory</a> -
<a href="BAGGEM0498L-15012021.zip">BAGGEM0498L-15012021.zip</a> 2021-01-19 06:56 3.6M
<a href="BAGGEM0498L-15022021.zip">BAGGEM0498L-15022021.zip</a> 2021-02-15 21:57 3.6M
<a href="BAGGEM0498L-15102020.zip">BAGGEM0498L-15102020.zip</a> 2020-10-24 03:19 3.6M
<a href="BAGGEM0498L-15112020.zip">BAGGEM0498L-15112020.zip</a> 2020-11-15 15:02 3.6M
<a href="BAGGEM0498L-15122020.zip">BAGGEM0498L-15122020.zip</a> 2020-12-15 13:48 3.6M
<hr></pre>
</body></html>"""
soup = BeautifulSoup(html, "html.parser")
files = []
for a in soup.find_all('a'):
href = a['href']
if '.zip' in href:
date = datetime.strptime(href.split('.')[0].split('-')[1], '%d%m%Y')
files.append([date, href])
files.sort(key=lambda x: x[0], reverse=True)
print("Latest:", files[0][1])
这会给你:
Latest: BAGGEM0498L-15022021.zip
zip文件可以自动下载如下:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
url = "https://extracten.bag.kadaster.nl/lvbag/extracten/Gemeente%20LVC/0498/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
files = []
for a in soup.find_all('a'):
href = a['href']
if '.zip' in href:
date = datetime.strptime(href.split('.')[0].split('-')[1], '%d%m%Y')
files.append([date, href])
files.sort(key=lambda x: x[0], reverse=True)
filename = files[0][1]
print("Latest:", filename)
# Download the zip file
with open(filename, 'wb') as f_zip:
r_zip = requests.get(f'{url}{filename}')
f_zip.write(r_zip.content)
在我的 python 脚本中,我加载了一个包含 Beautiful Soup 的网页。如何才能只下载最新(最新)的文件?
<a href="BAGGEM0498L-15012021.zip">BAGGEM0498L-15012021.zip</a> 2021-01-19 06:56 3.6M
<a href="BAGGEM0498L-15022021.zip">BAGGEM0498L-15022021.zip</a> 2021-02-15 21:57 3.6M
<a href="BAGGEM0498L-15102020.zip">BAGGEM0498L-15102020.zip</a> 2020-10-24 03:19 3.6M
<a href="BAGGEM0498L-15112020.zip">BAGGEM0498L-15112020.zip</a> 2020-11-15 15:02 3.6M
<a href="BAGGEM0498L-15122020.zip">BAGGEM0498L-15122020.zip</a> 2020-12-15 13:48 3.6M
页面的实际url是https://extracten.bag.kadaster.nl/lvbag/extracten/Gemeente LVC/0498/
如果您使用文件名来决定顺序,那么您首先需要提取日期并将其转换为 datetime
对象。构建文件名列表,然后使用此日期对它们进行排序。例如:
from bs4 import BeautifulSoup
from datetime import datetime
html = """<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<title>Index of /lvbag/extracten/Gemeente LVC/0498</title>
</head>
<body>
<h1>Index of /lvbag/extracten/Gemeente LVC/0498</h1>
<pre> <a href="?C=N;O=D">Name</a> <a href="?C=M;O=A">Last modified</a> <a href="?C=S;O=A">Size</a> <hr> <a href="/lvbag/extracten/Gemeente%20LVC/">Parent Directory</a> -
<a href="BAGGEM0498L-15012021.zip">BAGGEM0498L-15012021.zip</a> 2021-01-19 06:56 3.6M
<a href="BAGGEM0498L-15022021.zip">BAGGEM0498L-15022021.zip</a> 2021-02-15 21:57 3.6M
<a href="BAGGEM0498L-15102020.zip">BAGGEM0498L-15102020.zip</a> 2020-10-24 03:19 3.6M
<a href="BAGGEM0498L-15112020.zip">BAGGEM0498L-15112020.zip</a> 2020-11-15 15:02 3.6M
<a href="BAGGEM0498L-15122020.zip">BAGGEM0498L-15122020.zip</a> 2020-12-15 13:48 3.6M
<hr></pre>
</body></html>"""
soup = BeautifulSoup(html, "html.parser")
files = []
for a in soup.find_all('a'):
href = a['href']
if '.zip' in href:
date = datetime.strptime(href.split('.')[0].split('-')[1], '%d%m%Y')
files.append([date, href])
files.sort(key=lambda x: x[0], reverse=True)
print("Latest:", files[0][1])
这会给你:
Latest: BAGGEM0498L-15022021.zip
zip文件可以自动下载如下:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
url = "https://extracten.bag.kadaster.nl/lvbag/extracten/Gemeente%20LVC/0498/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
files = []
for a in soup.find_all('a'):
href = a['href']
if '.zip' in href:
date = datetime.strptime(href.split('.')[0].split('-')[1], '%d%m%Y')
files.append([date, href])
files.sort(key=lambda x: x[0], reverse=True)
filename = files[0][1]
print("Latest:", filename)
# Download the zip file
with open(filename, 'wb') as f_zip:
r_zip = requests.get(f'{url}{filename}')
f_zip.write(r_zip.content)