BeautifulSoup - 所有 href 链接似乎都没有提取
BeautifulSoup - All href links don't appear to be extracting
我正在尝试提取 class ['address'] 中的所有 href 链接。每次我 运行 代码,我只得到前 5 个,仅此而已,即使我知道应该有 9 个。
我阅读了下面的各种主题,无数次修改了我的代码,包括切换所有解析器(html.parser、html5lib、lxml、xml、lxml-xml) 但似乎没有任何效果。知道是什么导致它在第 5 次迭代后停止吗?我对 python 还是很陌生,所以如果这是我忽略的菜鸟错误,我深表歉意。任何帮助将不胜感激,即使是讽刺的回答:)
Beautiful Soup findAll doesn't find them all
Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds
BeautifulSoup fails to parse long view state
Beautifulsoup lost nodes
Missing parts on Beautiful Soup results
我在以下网页上使用了非常相似的代码,并且在抓取 hrefs 时没有遇到任何问题:
https://www.walgreens.com/storelistings/storesbystate.jsp?requestType=locator
https://www.walgreens.com/storelistings/storesbycity.jsp?requestType=locator&state=AK
我的代码如下:
import requests
from bs4 import BeautifulSoup
local_rg = requests.get('https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch')
local_rg_content = local_rg.content
local_rg_content_src = BeautifulSoup(local_rg_content, 'lxml')
for link in local_rg_content_src.find_all('div'):
local_class = str(link.get('class'))
if str("['address']") in str(local_class):
local_a = link.find_all('a')
for a_link in local_a:
local_href = str(a_link.get('href'))
print(local_href)
我的结果(前 5 个):
- /locator/walgreens-1470+w+北极光+blvd-anchorage-ak-99503/id=15092
- /locator/walgreens-725+e+northern+lights+blvd-anchorage-ak-99503/id=13656
- /locator/walgreens-4353+lake+otis+parkway-anchorage-ak-99508/id=15653
- /locator/walgreens-7600+debarr+rd-anchorage-ak-99504/id=12679
- /locator/walgreens-2197+w+钻石+blvd-anchorage-ak-99515/id=12680
但应该是9:
- /locator/walgreens-1470+w+北极光+blvd-anchorage-ak-99503/id=15092
- /locator/walgreens-725+e+northern+lights+blvd-anchorage-ak-99503/id=13656
- /locator/walgreens-4353+lake+otis+parkway-anchorage-ak-99508/id=15653
- /locator/walgreens-7600+debarr+rd-anchorage-ak-99504/id=12679
- /locator/walgreens-2197+w+钻石+blvd-anchorage-ak-99515/id=12680
- /locator/walgreens-2550+e+88th+ave-anchorage-ak-99507/id=15654
- /locator/walgreens-12405+布兰登+st-anchorage-ak-99515/id=13449
- /locator/walgreens-12051+old+glenn+hwy-eagle+river-ak-99577/id=15362
- /locator/walgreens-1721+e+parks+hwy-wasilla-ak-99654/id=12681
尝试使用 selenium
而不是 requests
来获取页面的源代码。以下是您的操作方法:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch')
local_rg_content = driver.page_source
driver.close()
local_rg_content_src = BeautifulSoup(local_rg_content, 'lxml')
其余代码相同。这是完整的代码:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch')
local_rg_content = driver.page_source
driver.close()
local_rg_content_src = BeautifulSoup(local_rg_content, 'lxml')
for link in local_rg_content_src.find_all('div'):
local_class = str(link.get('class'))
if str("['address']") in str(local_class):
local_a = link.find_all('a')
for a_link in local_a:
local_href = str(a_link.get('href'))
print(local_href)
输出:
/locator/walgreens-1470+w+northern+lights+blvd-anchorage-ak-99503/id=15092
/locator/walgreens-725+e+northern+lights+blvd-anchorage-ak-99503/id=13656
/locator/walgreens-4353+lake+otis+parkway-anchorage-ak-99508/id=15653
/locator/walgreens-7600+debarr+rd-anchorage-ak-99504/id=12679
/locator/walgreens-2197+w+dimond+blvd-anchorage-ak-99515/id=12680
/locator/walgreens-2550+e+88th+ave-anchorage-ak-99507/id=15654
/locator/walgreens-12405+brandon+st-anchorage-ak-99515/id=13449
/locator/walgreens-12051+old+glenn+hwy-eagle+river-ak-99577/id=15362
/locator/walgreens-1721+e+parks+hwy-wasilla-ak-99654/id=12681
该页面使用 Ajax 从外部 URL 加载商店信息。您可以使用 requests
/json
模块加载它:
import re
import json
import requests
url = 'https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch'
ajax_url = 'https://www.walgreens.com/locator/v1/stores/search?requestor=search'
m = re.search(r'"lat":([\d.-]+),"lng":([\d.-]+)', requests.get(url).text)
params = {
'lat': m.group(1),
'lng': m.group(2)
}
data = requests.post(ajax_url, json=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for result in data['results']:
print(result['store']['address']['street'])
print('https://www.walgreens.com' + result['storeSeoUrl'])
print('-' * 80)
打印:
1470 W NORTHERN LIGHTS BLVD
https://www.walgreens.com/locator/walgreens-1470+w+northern+lights+blvd-anchorage-ak-99503/id=15092
--------------------------------------------------------------------------------
725 E NORTHERN LIGHTS BLVD
https://www.walgreens.com/locator/walgreens-725+e+northern+lights+blvd-anchorage-ak-99503/id=13656
--------------------------------------------------------------------------------
4353 LAKE OTIS PARKWAY
https://www.walgreens.com/locator/walgreens-4353+lake+otis+parkway-anchorage-ak-99508/id=15653
--------------------------------------------------------------------------------
7600 DEBARR RD
https://www.walgreens.com/locator/walgreens-7600+debarr+rd-anchorage-ak-99504/id=12679
--------------------------------------------------------------------------------
2197 W DIMOND BLVD
https://www.walgreens.com/locator/walgreens-2197+w+dimond+blvd-anchorage-ak-99515/id=12680
--------------------------------------------------------------------------------
2550 E 88TH AVE
https://www.walgreens.com/locator/walgreens-2550+e+88th+ave-anchorage-ak-99507/id=15654
--------------------------------------------------------------------------------
12405 BRANDON ST
https://www.walgreens.com/locator/walgreens-12405+brandon+st-anchorage-ak-99515/id=13449
--------------------------------------------------------------------------------
12051 OLD GLENN HWY
https://www.walgreens.com/locator/walgreens-12051+old+glenn+hwy-eagle+river-ak-99577/id=15362
--------------------------------------------------------------------------------
1721 E PARKS HWY
https://www.walgreens.com/locator/walgreens-1721+e+parks+hwy-wasilla-ak-99654/id=12681
--------------------------------------------------------------------------------
我正在尝试提取 class ['address'] 中的所有 href 链接。每次我 运行 代码,我只得到前 5 个,仅此而已,即使我知道应该有 9 个。
我阅读了下面的各种主题,无数次修改了我的代码,包括切换所有解析器(html.parser、html5lib、lxml、xml、lxml-xml) 但似乎没有任何效果。知道是什么导致它在第 5 次迭代后停止吗?我对 python 还是很陌生,所以如果这是我忽略的菜鸟错误,我深表歉意。任何帮助将不胜感激,即使是讽刺的回答:)
Beautiful Soup findAll doesn't find them all
Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds
BeautifulSoup fails to parse long view state
Beautifulsoup lost nodes
Missing parts on Beautiful Soup results
我在以下网页上使用了非常相似的代码,并且在抓取 hrefs 时没有遇到任何问题: https://www.walgreens.com/storelistings/storesbystate.jsp?requestType=locator https://www.walgreens.com/storelistings/storesbycity.jsp?requestType=locator&state=AK
我的代码如下:
import requests
from bs4 import BeautifulSoup
local_rg = requests.get('https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch')
local_rg_content = local_rg.content
local_rg_content_src = BeautifulSoup(local_rg_content, 'lxml')
for link in local_rg_content_src.find_all('div'):
local_class = str(link.get('class'))
if str("['address']") in str(local_class):
local_a = link.find_all('a')
for a_link in local_a:
local_href = str(a_link.get('href'))
print(local_href)
我的结果(前 5 个):
- /locator/walgreens-1470+w+北极光+blvd-anchorage-ak-99503/id=15092
- /locator/walgreens-725+e+northern+lights+blvd-anchorage-ak-99503/id=13656
- /locator/walgreens-4353+lake+otis+parkway-anchorage-ak-99508/id=15653
- /locator/walgreens-7600+debarr+rd-anchorage-ak-99504/id=12679
- /locator/walgreens-2197+w+钻石+blvd-anchorage-ak-99515/id=12680
但应该是9:
- /locator/walgreens-1470+w+北极光+blvd-anchorage-ak-99503/id=15092
- /locator/walgreens-725+e+northern+lights+blvd-anchorage-ak-99503/id=13656
- /locator/walgreens-4353+lake+otis+parkway-anchorage-ak-99508/id=15653
- /locator/walgreens-7600+debarr+rd-anchorage-ak-99504/id=12679
- /locator/walgreens-2197+w+钻石+blvd-anchorage-ak-99515/id=12680
- /locator/walgreens-2550+e+88th+ave-anchorage-ak-99507/id=15654
- /locator/walgreens-12405+布兰登+st-anchorage-ak-99515/id=13449
- /locator/walgreens-12051+old+glenn+hwy-eagle+river-ak-99577/id=15362
- /locator/walgreens-1721+e+parks+hwy-wasilla-ak-99654/id=12681
尝试使用 selenium
而不是 requests
来获取页面的源代码。以下是您的操作方法:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch')
local_rg_content = driver.page_source
driver.close()
local_rg_content_src = BeautifulSoup(local_rg_content, 'lxml')
其余代码相同。这是完整的代码:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch')
local_rg_content = driver.page_source
driver.close()
local_rg_content_src = BeautifulSoup(local_rg_content, 'lxml')
for link in local_rg_content_src.find_all('div'):
local_class = str(link.get('class'))
if str("['address']") in str(local_class):
local_a = link.find_all('a')
for a_link in local_a:
local_href = str(a_link.get('href'))
print(local_href)
输出:
/locator/walgreens-1470+w+northern+lights+blvd-anchorage-ak-99503/id=15092
/locator/walgreens-725+e+northern+lights+blvd-anchorage-ak-99503/id=13656
/locator/walgreens-4353+lake+otis+parkway-anchorage-ak-99508/id=15653
/locator/walgreens-7600+debarr+rd-anchorage-ak-99504/id=12679
/locator/walgreens-2197+w+dimond+blvd-anchorage-ak-99515/id=12680
/locator/walgreens-2550+e+88th+ave-anchorage-ak-99507/id=15654
/locator/walgreens-12405+brandon+st-anchorage-ak-99515/id=13449
/locator/walgreens-12051+old+glenn+hwy-eagle+river-ak-99577/id=15362
/locator/walgreens-1721+e+parks+hwy-wasilla-ak-99654/id=12681
该页面使用 Ajax 从外部 URL 加载商店信息。您可以使用 requests
/json
模块加载它:
import re
import json
import requests
url = 'https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch'
ajax_url = 'https://www.walgreens.com/locator/v1/stores/search?requestor=search'
m = re.search(r'"lat":([\d.-]+),"lng":([\d.-]+)', requests.get(url).text)
params = {
'lat': m.group(1),
'lng': m.group(2)
}
data = requests.post(ajax_url, json=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for result in data['results']:
print(result['store']['address']['street'])
print('https://www.walgreens.com' + result['storeSeoUrl'])
print('-' * 80)
打印:
1470 W NORTHERN LIGHTS BLVD
https://www.walgreens.com/locator/walgreens-1470+w+northern+lights+blvd-anchorage-ak-99503/id=15092
--------------------------------------------------------------------------------
725 E NORTHERN LIGHTS BLVD
https://www.walgreens.com/locator/walgreens-725+e+northern+lights+blvd-anchorage-ak-99503/id=13656
--------------------------------------------------------------------------------
4353 LAKE OTIS PARKWAY
https://www.walgreens.com/locator/walgreens-4353+lake+otis+parkway-anchorage-ak-99508/id=15653
--------------------------------------------------------------------------------
7600 DEBARR RD
https://www.walgreens.com/locator/walgreens-7600+debarr+rd-anchorage-ak-99504/id=12679
--------------------------------------------------------------------------------
2197 W DIMOND BLVD
https://www.walgreens.com/locator/walgreens-2197+w+dimond+blvd-anchorage-ak-99515/id=12680
--------------------------------------------------------------------------------
2550 E 88TH AVE
https://www.walgreens.com/locator/walgreens-2550+e+88th+ave-anchorage-ak-99507/id=15654
--------------------------------------------------------------------------------
12405 BRANDON ST
https://www.walgreens.com/locator/walgreens-12405+brandon+st-anchorage-ak-99515/id=13449
--------------------------------------------------------------------------------
12051 OLD GLENN HWY
https://www.walgreens.com/locator/walgreens-12051+old+glenn+hwy-eagle+river-ak-99577/id=15362
--------------------------------------------------------------------------------
1721 E PARKS HWY
https://www.walgreens.com/locator/walgreens-1721+e+parks+hwy-wasilla-ak-99654/id=12681
--------------------------------------------------------------------------------