无法使用漂亮的汤检索此特定页面的 href
Cannot retrieve href for this particular page using beautiful soup
以下是我的代码:
# -*- coding: ascii -*-
# import libraries
from bs4 import BeautifulSoup
import urllib2
import re
def gethyperLinks(url):
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page, "html.parser")
hyperlinks = []
for link in soup.findAll('div', attrs={'class': 'ess-product-desc'}):
hyperlinks.append(link.get('href'))
return hyperlinks
print( gethyperLinks("http://biggestbook.com/ui/catalog.html#/search?cr=1&rs=12&st=BM&category=1") )
我想定位以下 href:
<div
class="ess-product-desc" ng-hide="currentView == 'detail' `&& deviceType=='mobile'"
ui-sref="detail({itemId: 'BWK6400', uom: 'CT', cm_sp:'', merchPreference:''})"
href="#/itemDetail?`itemId=BWK6400&uom=CT" aria-hidden="false">
<span>Center-Pull Hand Towels, 2-Ply, Perforated, 7 7/8 x 10, White, 600/RL, 6 RL/CT</span>
</div>
我想提取上面的 href,但我得到 []
作为最终答案。我做错了什么?
也许,你应该使用 'html5lib' 而不是 'html.parser',像这样:
from bs4 import BeautifulSoup
html="""
<div
class="ess-product-desc" ng-hide="currentView == 'detail' `&& deviceType=='mobile'"
ui-sref="detail({itemId: 'BWK6400', uom: 'CT', cm_sp:'', merchPreference:''})"
href="#/itemDetail?`itemId=BWK6400&uom=CT" aria-hidden="false">
<span>Center-Pull Hand Towels, 2-Ply, Perforated, 7 7/8 x 10, White, 600/RL, 6 RL/CT</span>
</div>
"""
soup = BeautifulSoup(html,"html5lib")
links = soup.findAll('div', attrs={'class': 'ess-product-desc'})
links[0].get("href")
您将获得:
'#/itemDetail?`itemId=BWK6400&uom=CT'
页面的值需要 javascript 到 运行。如果您检查响应(至少对于请求),那应该很清楚。我展示了一个使用硒的示例,以便 javascript 有时间 运行。在抓取会话期间从导航到的页面返回数据时,您可以将其转换为使用函数。
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get("http://biggestbook.com/ui/catalog.html#/search?cr=1&rs=12&st=BM&category=1")
links = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".ess-product-brand + [href]")))
results = [link.get_attribute('href') for link in links]
print(results)
有一个 API 调用,带有查询字符串参数,其中 returns 数据采用 json 格式。您必须传递推荐人和令牌。如果您能够获取令牌,或在会话中传递令牌(并且它仍然有效),并且可以破译查询字符串参数,那么这可能是采用基于请求的方法的方式。不确定 urllib。
以下是我的代码:
# -*- coding: ascii -*-
# import libraries
from bs4 import BeautifulSoup
import urllib2
import re
def gethyperLinks(url):
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page, "html.parser")
hyperlinks = []
for link in soup.findAll('div', attrs={'class': 'ess-product-desc'}):
hyperlinks.append(link.get('href'))
return hyperlinks
print( gethyperLinks("http://biggestbook.com/ui/catalog.html#/search?cr=1&rs=12&st=BM&category=1") )
我想定位以下 href:
<div
class="ess-product-desc" ng-hide="currentView == 'detail' `&& deviceType=='mobile'"
ui-sref="detail({itemId: 'BWK6400', uom: 'CT', cm_sp:'', merchPreference:''})"
href="#/itemDetail?`itemId=BWK6400&uom=CT" aria-hidden="false">
<span>Center-Pull Hand Towels, 2-Ply, Perforated, 7 7/8 x 10, White, 600/RL, 6 RL/CT</span>
</div>
我想提取上面的 href,但我得到 []
作为最终答案。我做错了什么?
也许,你应该使用 'html5lib' 而不是 'html.parser',像这样:
from bs4 import BeautifulSoup
html="""
<div
class="ess-product-desc" ng-hide="currentView == 'detail' `&& deviceType=='mobile'"
ui-sref="detail({itemId: 'BWK6400', uom: 'CT', cm_sp:'', merchPreference:''})"
href="#/itemDetail?`itemId=BWK6400&uom=CT" aria-hidden="false">
<span>Center-Pull Hand Towels, 2-Ply, Perforated, 7 7/8 x 10, White, 600/RL, 6 RL/CT</span>
</div>
"""
soup = BeautifulSoup(html,"html5lib")
links = soup.findAll('div', attrs={'class': 'ess-product-desc'})
links[0].get("href")
您将获得:
'#/itemDetail?`itemId=BWK6400&uom=CT'
页面的值需要 javascript 到 运行。如果您检查响应(至少对于请求),那应该很清楚。我展示了一个使用硒的示例,以便 javascript 有时间 运行。在抓取会话期间从导航到的页面返回数据时,您可以将其转换为使用函数。
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get("http://biggestbook.com/ui/catalog.html#/search?cr=1&rs=12&st=BM&category=1")
links = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".ess-product-brand + [href]")))
results = [link.get_attribute('href') for link in links]
print(results)
有一个 API 调用,带有查询字符串参数,其中 returns 数据采用 json 格式。您必须传递推荐人和令牌。如果您能够获取令牌,或在会话中传递令牌(并且它仍然有效),并且可以破译查询字符串参数,那么这可能是采用基于请求的方法的方式。不确定 urllib。