如何抓取 S3 上每个文件夹内的下载链接
How to scrape download links inside each folder on S3
我一直在抓取这个动态网站,它基本上是一个索引 link。我想将每个文件夹内文件的所有下载 link 都下载到最后一个子文件夹。我不知道我应该应用什么机制来做到这一点。
代码:
import time
import lxml
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = 'http://dl.ncsbe.gov.s3.amazonaws.com/index.html?prefix='
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(5)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
lists = []
for tags in soup.find_all('a'):
links = tags['href']
lists.append(links)
req = requests.get('https://s3.amazonaws.com/dl.ncsbe.gov?delimiter=/').content #from the network tools in F12
soup = BeautifulSoup(req, 'lxml')
names = []
for common in soup.find_all('prefix')[2:]:
names.append(common.text)
names.sort()
print(names)
我只想为每个文件夹中的每个文件类型下载 links。
这是一个 public S3 存储桶,因此您可以从根文件夹中获取 XML
:
https://s3.amazonaws.com/dl.ncsbe.gov/
这意味着您可以获得它作为响应,解析 XML
并重建所有键的 url。
方法如下:
import requests
import xmltodict
base_url = "https://s3.amazonaws.com/dl.ncsbe.gov"
data = xmltodict.parse(requests.get(base_url).content)
valid_extensions = (
".pdf", ".doc", ".docx", ".txt", ".zip", ".xlsx", "xls", ".csv", ".mp4",
)
for item in data["ListBucketResult"]["Contents"]:
if item["Key"].endswith(valid_extensions):
s3_url = base_url + "/" if not item["Key"].startswith("/") else base_url
print(f'{s3_url}{item["Key"].replace(" ", "%20")}')
这将以文件 URLS 的形式输出 S3 的整个结构:
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/2018%20County%20CF%20Procedures%20After%20the%20Election%20New%20Election%20Cycle%20Tasks.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Audit%20Checklist.doc
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Audit%20Letter%20-%20standard.docx
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-201%20Delinquent%20Repts.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-202%20Late%20Repts.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-203%20Noncompliant%20Comms.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Prohibited%20Receipts-Expenditures.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-201.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-202.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-203.pdf
and many more ...
我一直在抓取这个动态网站,它基本上是一个索引 link。我想将每个文件夹内文件的所有下载 link 都下载到最后一个子文件夹。我不知道我应该应用什么机制来做到这一点。
代码:
import time
import lxml
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = 'http://dl.ncsbe.gov.s3.amazonaws.com/index.html?prefix='
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(5)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
lists = []
for tags in soup.find_all('a'):
links = tags['href']
lists.append(links)
req = requests.get('https://s3.amazonaws.com/dl.ncsbe.gov?delimiter=/').content #from the network tools in F12
soup = BeautifulSoup(req, 'lxml')
names = []
for common in soup.find_all('prefix')[2:]:
names.append(common.text)
names.sort()
print(names)
我只想为每个文件夹中的每个文件类型下载 links。
这是一个 public S3 存储桶,因此您可以从根文件夹中获取 XML
:
https://s3.amazonaws.com/dl.ncsbe.gov/
这意味着您可以获得它作为响应,解析 XML
并重建所有键的 url。
方法如下:
import requests
import xmltodict
base_url = "https://s3.amazonaws.com/dl.ncsbe.gov"
data = xmltodict.parse(requests.get(base_url).content)
valid_extensions = (
".pdf", ".doc", ".docx", ".txt", ".zip", ".xlsx", "xls", ".csv", ".mp4",
)
for item in data["ListBucketResult"]["Contents"]:
if item["Key"].endswith(valid_extensions):
s3_url = base_url + "/" if not item["Key"].startswith("/") else base_url
print(f'{s3_url}{item["Key"].replace(" ", "%20")}')
这将以文件 URLS 的形式输出 S3 的整个结构:
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/2018%20County%20CF%20Procedures%20After%20the%20Election%20New%20Election%20Cycle%20Tasks.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Audit%20Checklist.doc
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Audit%20Letter%20-%20standard.docx
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-201%20Delinquent%20Repts.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-202%20Late%20Repts.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/ICR-203%20Noncompliant%20Comms.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/Prohibited%20Receipts-Expenditures.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-201.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-202.pdf
https://s3.amazonaws.com/dl.ncsbe.gov/Campaign_Finance/e-ICR-203.pdf
and many more ...