无法识别 link class
Unable to identify link class
我对编程还很陌生,Python 正在尝试编写这个简单的抓取工具,以从此页面提取治疗师的所有个人资料 URL
import requests
from bs4 import BeautifulSoup
def tru_crawler(max_pages):
p = '&page='
page = 1
while page <= max_pages:
url = 'http://www.therapy-directory.org.uk/search.php?search=Sheffield&distance=40&services[23]=on&services=23&business_type[individual]=on&uqs=626693' + p + str(page)
code = requests.get(url)
text = code.text
soup = BeautifulSoup(text)
for link in soup.findAll('a',{'member-summary':'h2'}):
href = 'http://www.therapy-directory.org.uk' + link.get('href')
yield href + '\n'
print(href)
page += 1
现在当我 运行 这段代码时,我什么也没得到,主要是因为 soup.findall 是空的。
配置文件 link 的 HTML 显示
<div class="member-summary">
<h2 class="">
<a href="/therapists/julia-church?uqs=626693">Julia Church</a>
</h2>
所以我不确定要在 soup.findall('a') 中传递哪些额外参数才能获取配置文件 URL
请帮忙
谢谢
更新-
我运行修改后的代码,返回一堆错误
好的,这次它在抓取第 1 页后返回了一堆错误
Traceback (most recent call last):
File "C:/Users/PB/PycharmProjects/crawler/crawler-revised.py", line 19, enter code here`in <module>
tru_crawler(3)
File "C:/Users/PB/PycharmProjects/crawler/crawler-revised.py", line 9, in tru_crawler
code = requests.get(url)
File "C:\Python27\lib\requests\api.py", line 68, in get
return request('get', url, **kwargs)
File "C:\Python27\lib\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\requests\sessions.py", line 464, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python27\lib\requests\sessions.py", line 602, in send
history = [resp for resp in gen] if allow_redirects else []
File "C:\Python27\lib\requests\sessions.py", line 195, in resolve_redirects
allow_redirects=False,
File "C:\Python27\lib\requests\sessions.py", line 576, in send
r = adapter.send(request, **kwargs)
File "C:\Python27\lib\requests\adapters.py", line 415, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",))
这里出了什么问题?
目前,您拥有的findAll()
参数没有意义。它显示:找到所有 <a>
具有 member-class
属性等于 "h2".
一种可能的方法是使用 select()
方法传递 CSS selector 作为参数:
for link in soup.select('div.member-summary h2 a'):
href = 'http://www.therapy-directory.org.uk' + link.get('href')
yield href + '\n'
print(href)
上方的 CSS 选择器显示:找到 <div>
具有 class 等于 "member-summary" 的标签,然后在该 <div>
中找到 <h2>
标签,然后在 <h2>
中找到 <a>
标签。
工作示例:
import requests
from bs4 import BeautifulSoup
p = '&page='
page = 1
url = 'http://www.therapy-directory.org.uk/search.php?search=Sheffield&distance=40&services[23]=on&services=23&business_type[individual]=on&uqs=626693' + p + str(page)
code = requests.get(url)
text = code.text
soup = BeautifulSoup(text)
for link in soup.select('div.member-summary h2 a'):
href = 'http://www.therapy-directory.org.uk' + link.get('href')
print(href)
输出(修剪,来自总共 26 个链接):
http://www.therapy-directory.org.uk/therapists/lesley-lister?uqs=626693
http://www.therapy-directory.org.uk/therapists/fiona-jeffrey?uqs=626693
http://www.therapy-directory.org.uk/therapists/ann-grant?uqs=626693
.....
.....
http://www.therapy-directory.org.uk/therapists/jan-garbutt?uqs=626693
我对编程还很陌生,Python 正在尝试编写这个简单的抓取工具,以从此页面提取治疗师的所有个人资料 URL
import requests
from bs4 import BeautifulSoup
def tru_crawler(max_pages):
p = '&page='
page = 1
while page <= max_pages:
url = 'http://www.therapy-directory.org.uk/search.php?search=Sheffield&distance=40&services[23]=on&services=23&business_type[individual]=on&uqs=626693' + p + str(page)
code = requests.get(url)
text = code.text
soup = BeautifulSoup(text)
for link in soup.findAll('a',{'member-summary':'h2'}):
href = 'http://www.therapy-directory.org.uk' + link.get('href')
yield href + '\n'
print(href)
page += 1
现在当我 运行 这段代码时,我什么也没得到,主要是因为 soup.findall 是空的。
配置文件 link 的 HTML 显示
<div class="member-summary">
<h2 class="">
<a href="/therapists/julia-church?uqs=626693">Julia Church</a>
</h2>
所以我不确定要在 soup.findall('a') 中传递哪些额外参数才能获取配置文件 URL
请帮忙
谢谢
更新-
我运行修改后的代码,返回一堆错误
好的,这次它在抓取第 1 页后返回了一堆错误
Traceback (most recent call last):
File "C:/Users/PB/PycharmProjects/crawler/crawler-revised.py", line 19, enter code here`in <module>
tru_crawler(3)
File "C:/Users/PB/PycharmProjects/crawler/crawler-revised.py", line 9, in tru_crawler
code = requests.get(url)
File "C:\Python27\lib\requests\api.py", line 68, in get
return request('get', url, **kwargs)
File "C:\Python27\lib\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\requests\sessions.py", line 464, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python27\lib\requests\sessions.py", line 602, in send
history = [resp for resp in gen] if allow_redirects else []
File "C:\Python27\lib\requests\sessions.py", line 195, in resolve_redirects
allow_redirects=False,
File "C:\Python27\lib\requests\sessions.py", line 576, in send
r = adapter.send(request, **kwargs)
File "C:\Python27\lib\requests\adapters.py", line 415, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",))
这里出了什么问题?
目前,您拥有的findAll()
参数没有意义。它显示:找到所有 <a>
具有 member-class
属性等于 "h2".
一种可能的方法是使用 select()
方法传递 CSS selector 作为参数:
for link in soup.select('div.member-summary h2 a'):
href = 'http://www.therapy-directory.org.uk' + link.get('href')
yield href + '\n'
print(href)
上方的 CSS 选择器显示:找到 <div>
具有 class 等于 "member-summary" 的标签,然后在该 <div>
中找到 <h2>
标签,然后在 <h2>
中找到 <a>
标签。
工作示例:
import requests
from bs4 import BeautifulSoup
p = '&page='
page = 1
url = 'http://www.therapy-directory.org.uk/search.php?search=Sheffield&distance=40&services[23]=on&services=23&business_type[individual]=on&uqs=626693' + p + str(page)
code = requests.get(url)
text = code.text
soup = BeautifulSoup(text)
for link in soup.select('div.member-summary h2 a'):
href = 'http://www.therapy-directory.org.uk' + link.get('href')
print(href)
输出(修剪,来自总共 26 个链接):
http://www.therapy-directory.org.uk/therapists/lesley-lister?uqs=626693
http://www.therapy-directory.org.uk/therapists/fiona-jeffrey?uqs=626693
http://www.therapy-directory.org.uk/therapists/ann-grant?uqs=626693
.....
.....
http://www.therapy-directory.org.uk/therapists/jan-garbutt?uqs=626693