BeautifulSoup 迭代无效
BeautifulSoup Iteration not working
from bs4 import BeautifulSoup
import requests
s=requests.Session()
r=s.get('http://www.virginiaequestrian.com/main.cfm?action=greenpages&GPType=8')
soup=BeautifulSoup(r.text,'html5lib')
DataGrid=soup.find('tbody')
test=[]
for tr in DataGrid.find_all('tr')[:3]:
for td in tr.find_all('td'):
print td.string
您好,我正在尝试解析此网站 (http://www.virginiaequestrian.com/main.cfm?action=greenpages&GPType=8) 的 html 并获取 table 数据。我试图从我的结果中排除前三个 table 行,但由于某种原因我无法让解析器执行此操作。这是我第一次专业的抓取尝试,我完全不知道如何让它工作。我猜这可能与我正在使用的 html5lib 解析器有关,但老实说我不知道。有人可以告诉我如何让它工作吗?
作为一个很好的测试,以前三个 rows.This 方式提取数据将非常有用,我可以确信完成的查询将从除这些以外的任何内容中提取。
例如 table 中的第一行将是 'Equestrian Web Sites'
您只取前三个而不忽略 [:3]
,它从列表中切出 前三个元素:
DataGrid.find_all('tr')[:3] # first three elements
应该是 DataGrid.find_all('tr')[3:]
# 除前三个元素外的所有元素
from bs4 import BeautifulSoup
import requests
r=requests.get('http://www.virginiaequestrian.com/main.cfm?action=greenpages&GPType=8')
soup=BeautifulSoup(r.content)
tbl = soup.find("table")
for tag in tbl.find_all("tr")[3:]:
for td in tag.find_all('td'):
print td.text
上述 tbl.find_all("tr")
切片并使用两个不同的解析器输出时:
In [20]: soup=BeautifulSoup(r.content,"html.parser")
In [21]: tbl = soup.find("table")
In [22]: len(tbl.find_all("tr"))
Out[22]: 364
In [23]: len(tbl.find_all("tr")[3:])
Out[23]: 361
In [24]: soup=BeautifulSoup(r.content,"lxml")
In [25]: tbl = soup.find("table")
In [26]: len(tbl.find_all("tr")[3:])
Out[26]: 361
In [27]: len(tbl.find_all("tr"))
Out[27]: 364
如果你真的想要 more
hrefs 那么你应该这样做,为每个 tr
获取 a
标签,在你实际的行之前还有 6 个 tr想要所以你需要跳过 6:
tbl = soup.find("table")
out = (tag.find('a') for tag in tbl.find_all("tr")[6:])
for a in out:
print(a["href"])
输出:
main.cfm?action=greenpages&sub=view&ID=9068
main.cfm?action=greenpages&sub=view&ID=9504
main.cfm?action=greenpages&sub=view&ID=10868
main.cfm?action=greenpages&sub=view&ID=10261
main.cfm?action=greenpages&sub=view&ID=10477
main.cfm?action=greenpages&sub=view&ID=10708
main.cfm?action=greenpages&sub=view&ID=11712
main.cfm?action=greenpages&sub=view&ID=12402
main.cfm?action=greenpages&sub=view&ID=12496
..................
要使用链接,只需在主链接前添加 url:
for a in out:
print("http://www.virginiaequestrian.com/{}".format(a["href"]))
输出:
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=9068
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=9504
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10868
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10261
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10477
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10708
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=11712
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=12402
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=12496
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=12633
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=13528
如果您打开第一个,它将引导您进入 马术网站,即您想要的第一个数据。
from bs4 import BeautifulSoup
import requests
s=requests.Session()
r=s.get('http://www.virginiaequestrian.com/main.cfm?action=greenpages&GPType=8')
soup=BeautifulSoup(r.text,'html5lib')
DataGrid=soup.find('tbody')
test=[]
for tr in DataGrid.find_all('tr')[:3]:
for td in tr.find_all('td'):
print td.string
您好,我正在尝试解析此网站 (http://www.virginiaequestrian.com/main.cfm?action=greenpages&GPType=8) 的 html 并获取 table 数据。我试图从我的结果中排除前三个 table 行,但由于某种原因我无法让解析器执行此操作。这是我第一次专业的抓取尝试,我完全不知道如何让它工作。我猜这可能与我正在使用的 html5lib 解析器有关,但老实说我不知道。有人可以告诉我如何让它工作吗?
作为一个很好的测试,以前三个 rows.This 方式提取数据将非常有用,我可以确信完成的查询将从除这些以外的任何内容中提取。
例如 table 中的第一行将是 'Equestrian Web Sites'
您只取前三个而不忽略 [:3]
,它从列表中切出 前三个元素:
DataGrid.find_all('tr')[:3] # first three elements
应该是 DataGrid.find_all('tr')[3:]
# 除前三个元素外的所有元素
from bs4 import BeautifulSoup
import requests
r=requests.get('http://www.virginiaequestrian.com/main.cfm?action=greenpages&GPType=8')
soup=BeautifulSoup(r.content)
tbl = soup.find("table")
for tag in tbl.find_all("tr")[3:]:
for td in tag.find_all('td'):
print td.text
上述 tbl.find_all("tr")
切片并使用两个不同的解析器输出时:
In [20]: soup=BeautifulSoup(r.content,"html.parser")
In [21]: tbl = soup.find("table")
In [22]: len(tbl.find_all("tr"))
Out[22]: 364
In [23]: len(tbl.find_all("tr")[3:])
Out[23]: 361
In [24]: soup=BeautifulSoup(r.content,"lxml")
In [25]: tbl = soup.find("table")
In [26]: len(tbl.find_all("tr")[3:])
Out[26]: 361
In [27]: len(tbl.find_all("tr"))
Out[27]: 364
如果你真的想要 more
hrefs 那么你应该这样做,为每个 tr
获取 a
标签,在你实际的行之前还有 6 个 tr想要所以你需要跳过 6:
tbl = soup.find("table")
out = (tag.find('a') for tag in tbl.find_all("tr")[6:])
for a in out:
print(a["href"])
输出:
main.cfm?action=greenpages&sub=view&ID=9068
main.cfm?action=greenpages&sub=view&ID=9504
main.cfm?action=greenpages&sub=view&ID=10868
main.cfm?action=greenpages&sub=view&ID=10261
main.cfm?action=greenpages&sub=view&ID=10477
main.cfm?action=greenpages&sub=view&ID=10708
main.cfm?action=greenpages&sub=view&ID=11712
main.cfm?action=greenpages&sub=view&ID=12402
main.cfm?action=greenpages&sub=view&ID=12496
..................
要使用链接,只需在主链接前添加 url:
for a in out:
print("http://www.virginiaequestrian.com/{}".format(a["href"]))
输出:
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=9068
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=9504
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10868
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10261
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10477
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=10708
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=11712
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=12402
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=12496
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=12633
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=13528
如果您打开第一个,它将引导您进入 马术网站,即您想要的第一个数据。