为什么当我尝试抓取此网页上 PDF 的链接时,我只在 return 中得到一个空列表?
Why Is it that when I try to scrape the links to the PDFs on this web page I just get an empty list in return?
我正在尝试抓取此 web page 上 PDF 的链接。但是,我在 return 中得到一个空列表。对此问题的任何帮助将不胜感激。
这是我使用的代码:
import requests
from bs4 import BeautifulSoup
import lxml
import csv
url="https://occ.ca/our-publications/"
source=requests.get(url).text
soup=BeautifulSoup(source,'lxml')
match=soup.find_all('div')
print(match)
页面返回 403(禁止请求)和一些错误页面。如果你添加一个用户代理 header 它 returns 200 (OK) 你需要的页面:
requests.get(url, headers={'User-Agent': 'Mozilla'})
低于
import requests
from bs4 import BeautifulSoup
response = source = requests.get('https://occ.ca/our-publications/', headers={'User-Agent': 'Mozilla'})
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html')
pdfs = soup.findAll('div', {"class": "publicationoverlay"})
links = [pdf.find('a').attrs['href'] for pdf in pdfs]
print(links)
输出
['https://occ.ca/wp-content/uploads/The-Great-Mosaic-Reviving-Ontarios-Regional-Economies.pdf', 'https://occ.ca/wp-content/uploads/OCC-Letter-in-support-of-the-OPG-Pickering-Nuclear-Nomination.pdf', 'https://occ.ca/wp-content/uploads/OCC-Beverage-Alcohol-Report.pdf', 'https://occ.ca/wp-content/uploads/Industrial-Electricity-Rates.pdf', 'https://occ.ca/wp-content/uploads/OCC-Letter_Strategic-Approach-to-Alcohol-Sales.pdf', 'https://occ.ca/wp-content/uploads/OCC-Submission-Modernizing-Ontarios-Environmental-Assessment-Program.pdf', 'https://occ.ca/wp-content/uploads/OCC-Letter-on-Ticket-Sales-Act.pdf', 'https://occ.ca/wp-content/uploads/2018-2019-Policy-Report-Card.pdf', 'https://occ.ca/wp-content/uploads/Letter-on-Right-to-Repair-May-1.pdf', 'https://occ.ca/wp-content/uploads/Federal-Carbon-Tax-Transparency-Act-2019-OCC.pdf', 'https://occ.ca/wp-content/uploads/Waste-and-Litter-Submission-_-Final.pdf', 'https://occ.ca/wp-content/uploads/Supporting-Ontarios-Budding-Cannabis-Industry.pdf']
那是因为在您的原始请求中您收到了 403 禁止请求。
Python 请求默认添加 Headers 如下:
{
'User-Agent': 'python-requests/2.21.0',
'Accept-Encoding': 'gzip, deflate',
'Accept': '*/*',
'Connection': 'keep-alive',
'Content-Length': '40',
'Content-Type': 'application/json'
}
一些网站阻止了这样的 headers。所以你收到了 403 HTTP 错误。
source=requests.get(url, headers={'User-Agent': 'Mozilla'})
添加这个会解决那个问题,你会得到你想要的内容。
我正在尝试抓取此 web page 上 PDF 的链接。但是,我在 return 中得到一个空列表。对此问题的任何帮助将不胜感激。
这是我使用的代码:
import requests
from bs4 import BeautifulSoup
import lxml
import csv
url="https://occ.ca/our-publications/"
source=requests.get(url).text
soup=BeautifulSoup(source,'lxml')
match=soup.find_all('div')
print(match)
页面返回 403(禁止请求)和一些错误页面。如果你添加一个用户代理 header 它 returns 200 (OK) 你需要的页面:
requests.get(url, headers={'User-Agent': 'Mozilla'})
低于
import requests
from bs4 import BeautifulSoup
response = source = requests.get('https://occ.ca/our-publications/', headers={'User-Agent': 'Mozilla'})
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html')
pdfs = soup.findAll('div', {"class": "publicationoverlay"})
links = [pdf.find('a').attrs['href'] for pdf in pdfs]
print(links)
输出
['https://occ.ca/wp-content/uploads/The-Great-Mosaic-Reviving-Ontarios-Regional-Economies.pdf', 'https://occ.ca/wp-content/uploads/OCC-Letter-in-support-of-the-OPG-Pickering-Nuclear-Nomination.pdf', 'https://occ.ca/wp-content/uploads/OCC-Beverage-Alcohol-Report.pdf', 'https://occ.ca/wp-content/uploads/Industrial-Electricity-Rates.pdf', 'https://occ.ca/wp-content/uploads/OCC-Letter_Strategic-Approach-to-Alcohol-Sales.pdf', 'https://occ.ca/wp-content/uploads/OCC-Submission-Modernizing-Ontarios-Environmental-Assessment-Program.pdf', 'https://occ.ca/wp-content/uploads/OCC-Letter-on-Ticket-Sales-Act.pdf', 'https://occ.ca/wp-content/uploads/2018-2019-Policy-Report-Card.pdf', 'https://occ.ca/wp-content/uploads/Letter-on-Right-to-Repair-May-1.pdf', 'https://occ.ca/wp-content/uploads/Federal-Carbon-Tax-Transparency-Act-2019-OCC.pdf', 'https://occ.ca/wp-content/uploads/Waste-and-Litter-Submission-_-Final.pdf', 'https://occ.ca/wp-content/uploads/Supporting-Ontarios-Budding-Cannabis-Industry.pdf']
那是因为在您的原始请求中您收到了 403 禁止请求。 Python 请求默认添加 Headers 如下:
{
'User-Agent': 'python-requests/2.21.0',
'Accept-Encoding': 'gzip, deflate',
'Accept': '*/*',
'Connection': 'keep-alive',
'Content-Length': '40',
'Content-Type': 'application/json'
}
一些网站阻止了这样的 headers。所以你收到了 403 HTTP 错误。
source=requests.get(url, headers={'User-Agent': 'Mozilla'})
添加这个会解决那个问题,你会得到你想要的内容。