为什么当我尝试抓取此网页上 PDF 的链接时，我只在 return 中得到一个空列表？

Question

我正在尝试抓取此 web page 上 PDF 的链接。但是，我在 return 中得到一个空列表。对此问题的任何帮助将不胜感激。

这是我使用的代码：

import requests
from bs4 import BeautifulSoup
import lxml
import csv
url="https://occ.ca/our-publications/"
source=requests.get(url).text
soup=BeautifulSoup(source,'lxml')
match=soup.find_all('div')
print(match)

Answer 1

页面返回 403（禁止请求）和一些错误页面。如果你添加一个用户代理 header 它 returns 200 (OK) 你需要的页面：

requests.get(url, headers={'User-Agent': 'Mozilla'})

Answer 2

低于

import requests
from bs4 import BeautifulSoup

response = source = requests.get('https://occ.ca/our-publications/', headers={'User-Agent': 'Mozilla'})
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html')
    pdfs = soup.findAll('div', {"class": "publicationoverlay"})
    links = [pdf.find('a').attrs['href'] for pdf in pdfs]
    print(links)

输出

['https://occ.ca/wp-content/uploads/The-Great-Mosaic-Reviving-Ontarios-Regional-Economies.pdf', 'https://occ.ca/wp-content/uploads/OCC-Letter-in-support-of-the-OPG-Pickering-Nuclear-Nomination.pdf', 'https://occ.ca/wp-content/uploads/OCC-Beverage-Alcohol-Report.pdf', 'https://occ.ca/wp-content/uploads/Industrial-Electricity-Rates.pdf', 'https://occ.ca/wp-content/uploads/OCC-Letter_Strategic-Approach-to-Alcohol-Sales.pdf', 'https://occ.ca/wp-content/uploads/OCC-Submission-Modernizing-Ontarios-Environmental-Assessment-Program.pdf', 'https://occ.ca/wp-content/uploads/OCC-Letter-on-Ticket-Sales-Act.pdf', 'https://occ.ca/wp-content/uploads/2018-2019-Policy-Report-Card.pdf', 'https://occ.ca/wp-content/uploads/Letter-on-Right-to-Repair-May-1.pdf', 'https://occ.ca/wp-content/uploads/Federal-Carbon-Tax-Transparency-Act-2019-OCC.pdf', 'https://occ.ca/wp-content/uploads/Waste-and-Litter-Submission-_-Final.pdf', 'https://occ.ca/wp-content/uploads/Supporting-Ontarios-Budding-Cannabis-Industry.pdf']

Answer 3

那是因为在您的原始请求中您收到了 403 禁止请求。 Python 请求默认添加 Headers 如下：

{
 'User-Agent': 'python-requests/2.21.0', 
 'Accept-Encoding': 'gzip, deflate', 
 'Accept': '*/*', 
 'Connection': 'keep-alive', 
 'Content-Length': '40', 
  'Content-Type': 'application/json'
 }

一些网站阻止了这样的 headers。所以你收到了 403 HTTP 错误。

source=requests.get(url, headers={'User-Agent': 'Mozilla'})

添加这个会解决那个问题，你会得到你想要的内容。

为什么当我尝试抓取此网页上 PDF 的链接时，我只在 return 中得到一个空列表？

Why Is it that when I try to scrape the links to the PDFs on this web page I just get an empty list in return?

python

beautifulsoup

web-crawler