BeautifulSoup 从 Google 获取 <cite> 个标签
BeautifulSoup get <cite> tags from Google
我正在制作一个 Python 脚本,用于在 Google 上搜索术语并仅获取 PDF 链接。
我正在尝试获取标有 <cite>
的 "green" 个搜索结果。它们不是链接,只是标题。
这是我目前拥有的:
from bs4 import BeautifulSoup
import requests
import re
url = "http://www.google.com/search?q=shakespeare+pdf"
get = requests.get(url).text
soup = BeautifulSoup(get)
pdf = re.compile(r"\.(pdf)")
cite_pdfs = soup.find_all(pdf, class_="_Rm")
print cite_pdfs
但是,列表只有 returns []
即什么都没有。
这是 it.I 的一个很好的实现,使用 hdr request from urllib2 以便通过 HTTP Error 403: Forbidden
from BeautifulSoup import BeautifulSoup
import urllib2
site= "http://www.google.com/search?q=shakespeare+pdf"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(site, headers=hdr)
try:
page = urllib2.urlopen(req).read()
soup = BeautifulSoup(page)
ka=soup.findAll('cite',attrs={'class':'_Rm'})
for i in ka:
print i.text
except urllib2.HTTPError, e:
print e.fp.read()
这是结果,
davidlucking.com/documents/Shakespeare-Complete%20Works.pdf
www.artsvivants.ca/pdf/.../shakespeare_overvie...
www.folgerdigitaltexts.org/PDF/Ham.pdf
sparks.eserver.org/.../shakespeare-tempest.pdf
manybooks.net/.../shakespeetext94shaks12.htm...
www.w3.org/People/maxf/.../hamlet.pdf
www.adweek.com/.../free...shakespeare.../1868...
www.goodreads.com/ebooks/.../1420.Hamlet
calhoun.k12.il.us/teachers/wdeffenbaugh/.../Shakespeare%20Sonnets.pdf
www.freeclassicebooks.com/william_shakespea...
您正在寻找这个:
for result in soup.select('.tF2Cxc'):
if result.select_one('.ZGwO7'):
pdf_file = result.select_one('.yuRUbf a')['href']
此外,可能是因为没有 user-agent
specified because the default requests
user-agent
is python-requests 因此 Google 阻止了请求,因为它知道这是一个机器人而不是“真实”的用户访问。 User-agent
通过将此信息添加到 HTTP 请求中来伪造用户访问 headers。
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "best lasagna recipe:pdf"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
# check if PDF is present via according CSS class
if result.select_one('.ZGwO7'):
pdf_file = result.select_one('.yuRUbf a')['href']
print(pdf_file)
---------
'''
http://www.bakersedge.com/PDF/Lasagna.pdf
http://greatgreens.ca/recipes/Recipe%20-%20Worlds%20Best%20Lasagna.pdf
https://liparifoods.com/wp-content/uploads/2015/10/lipari-foods-holiday-recipes.pdf
...
'''
或者,您可以使用 SerpApi 中的 Google Search Results API 来实现相同的目的。这是付费 API 和免费计划。
你的情况的不同之处在于它允许你快速获取数据,而不是从头开始创建解析器,并随着时间的推移维护它。
为实现您的目标而集成的代码:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "best lasagna recipe:pdf",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
for result results['organic_results']:
if '.pdf' in result['link']:
pdf_file = result['link']
print(pdf_file)
---------
'''
http://www.bakersedge.com/PDF/Lasagna.pdf
http://greatgreens.ca/recipes/Recipe%20-%20Worlds%20Best%20Lasagna.pdf
https://liparifoods.com/wp-content/uploads/2015/10/lipari-foods-holiday-recipes.pdf
...
'''
P.S - 我写了更多 in-depth 博客 post 关于 how to reduce the chance of being blocked while web scraping search engines.
Disclaimer, I work for SerpApi.
我正在制作一个 Python 脚本,用于在 Google 上搜索术语并仅获取 PDF 链接。
我正在尝试获取标有 <cite>
的 "green" 个搜索结果。它们不是链接,只是标题。
这是我目前拥有的:
from bs4 import BeautifulSoup
import requests
import re
url = "http://www.google.com/search?q=shakespeare+pdf"
get = requests.get(url).text
soup = BeautifulSoup(get)
pdf = re.compile(r"\.(pdf)")
cite_pdfs = soup.find_all(pdf, class_="_Rm")
print cite_pdfs
但是,列表只有 returns []
即什么都没有。
这是 it.I 的一个很好的实现,使用 hdr request from urllib2 以便通过 HTTP Error 403: Forbidden
from BeautifulSoup import BeautifulSoup
import urllib2
site= "http://www.google.com/search?q=shakespeare+pdf"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(site, headers=hdr)
try:
page = urllib2.urlopen(req).read()
soup = BeautifulSoup(page)
ka=soup.findAll('cite',attrs={'class':'_Rm'})
for i in ka:
print i.text
except urllib2.HTTPError, e:
print e.fp.read()
这是结果,
davidlucking.com/documents/Shakespeare-Complete%20Works.pdf
www.artsvivants.ca/pdf/.../shakespeare_overvie...
www.folgerdigitaltexts.org/PDF/Ham.pdf
sparks.eserver.org/.../shakespeare-tempest.pdf
manybooks.net/.../shakespeetext94shaks12.htm...
www.w3.org/People/maxf/.../hamlet.pdf
www.adweek.com/.../free...shakespeare.../1868...
www.goodreads.com/ebooks/.../1420.Hamlet
calhoun.k12.il.us/teachers/wdeffenbaugh/.../Shakespeare%20Sonnets.pdf
www.freeclassicebooks.com/william_shakespea...
您正在寻找这个:
for result in soup.select('.tF2Cxc'):
if result.select_one('.ZGwO7'):
pdf_file = result.select_one('.yuRUbf a')['href']
此外,可能是因为没有 user-agent
specified because the default requests
user-agent
is python-requests 因此 Google 阻止了请求,因为它知道这是一个机器人而不是“真实”的用户访问。 User-agent
通过将此信息添加到 HTTP 请求中来伪造用户访问 headers。
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "best lasagna recipe:pdf"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
# check if PDF is present via according CSS class
if result.select_one('.ZGwO7'):
pdf_file = result.select_one('.yuRUbf a')['href']
print(pdf_file)
---------
'''
http://www.bakersedge.com/PDF/Lasagna.pdf
http://greatgreens.ca/recipes/Recipe%20-%20Worlds%20Best%20Lasagna.pdf
https://liparifoods.com/wp-content/uploads/2015/10/lipari-foods-holiday-recipes.pdf
...
'''
或者,您可以使用 SerpApi 中的 Google Search Results API 来实现相同的目的。这是付费 API 和免费计划。
你的情况的不同之处在于它允许你快速获取数据,而不是从头开始创建解析器,并随着时间的推移维护它。
为实现您的目标而集成的代码:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "best lasagna recipe:pdf",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
for result results['organic_results']:
if '.pdf' in result['link']:
pdf_file = result['link']
print(pdf_file)
---------
'''
http://www.bakersedge.com/PDF/Lasagna.pdf
http://greatgreens.ca/recipes/Recipe%20-%20Worlds%20Best%20Lasagna.pdf
https://liparifoods.com/wp-content/uploads/2015/10/lipari-foods-holiday-recipes.pdf
...
'''
P.S - 我写了更多 in-depth 博客 post 关于 how to reduce the chance of being blocked while web scraping search engines.
Disclaimer, I work for SerpApi.