我无法使用 BeautifulSoup4 获取完整数据
I am not able to get the full data using BeautifulSoup4
我正在尝试抓取此 website 以进行简单学习,我只是想使用 find_all()
命令打印该网站中的所有产品。标签为 tbody
和 class product-variant-list
的产品共有 12 个。但是我只有五个,我找不到这里的问题。
我的代码:
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.zooplus.co.uk/shop/dogs/dry_dog_food/royal_canin_vet_diet'
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html,"lxml")
product_list = soup.find_all("tbody", {"class":"product-variants-list"})
i=0
for product in product_list:
product_name = product.find("a",{"class":"follow3"}).find("b").text
print i, product_name
#product_variants = product.find_all("tr",{"class":"product-variant"})
i +=1
html是:
<table id="product-list" width="658" cellspacing="0" cellpadding="2" border="0">
<tbody class="products-header"></tbody>
<tbody class="product-variants-list">
<tr></tr>
<tr class="text" style="background-color:#ffffff;">
<td valign="middle" colspan="6">
<a class="follow3" title="Royal Canin Veterinary Diet - Hypoallergenic DR 21" href="/shop/dogs/dry_dog_food/royal_canin_vet_diet/307309">
<b>
Royal Canin Veterinary Diet - Hypoallergenic DR 21
</b>
::after
</a>
</td>
</tr>
<tr class="text" style="background-color:#ffffff;"></tr>
<tr class="text product-variant"></tr>
<tr class="text product-variant"></tr>
<tr class="text product-variant"></tr>
</tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="products-footer"></tbody>
</table>
我的输出:
0 Royal Canin Veterinary Diet - Hypoallergenic DR 21
1 Royal Canin Veterinary Diet - Sensitivity Control SC 21
2 Royal Canin Veterinary Diet - Gastro Intestinal GI 25
3 Royal Canin Veterinary Diet - Renal RF 14
4 Royal Canin Veterinary Diet - Obesity Management DP 34
我认为你的错误是这一行:
soup = BeautifulSoup(html,"lxml")
如果您将 "lxml" 更改为 "html.parser",它将起作用。
这是完整的代码:
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.zooplus.co.uk/shop/dogs/dry_dog_food/royal_canin_vet_diet'
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html,"html.parser")
product_list = soup.find_all("tbody", {"class":"product-variants-list"})
i=0
for product in product_list:
product_name = product.find("a",{"class":"follow3"}).find("b").text
print i, product_name
#product_variants = product.find_all("tr",{"class":"product-variant"})
i +=1
结果是:
0 Royal Canin Veterinary Diet - Hypoallergenic DR 21
1 Royal Canin Veterinary Diet - Sensitivity Control SC 21
2 Royal Canin Veterinary Diet - Gastro Intestinal GI 25
3 Royal Canin Veterinary Diet - Renal RF 14
4 Royal Canin Veterinary Diet - Obesity Management DP 34
5 Royal Canin Veterinary Diet - Urinary S/O LP 18
6 Royal Canin Veterinary Diet - Mobility MS 25
7 Royal Canin Veterinary Diet - Satiety Support SAT 30
8 Royal Canin Veterinary Diet - Hepatic HF 16
9 Royal Canin Veterinary Diet - Dental DLK 22
10 Royal Canin Veterinary Diet - Diabetic DS 37
11 Royal Canin Veterinary Diet - Calm CD 25
希望对您有所帮助!
祝你有愉快的一天
** 更新 **
lxml 和 html 解析器之间的区别在这里得到了很好的解释:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers
If the document is not perfectly-formed, different parsers will give different results. Here’s a short, invalid document parsed using lxml’s HTML parser. Note that the dangling
tag is simply ignored:
使用 lxml 的结果
BeautifulSoup("<a></p>", "lxml")
<html><body><a></a></body></html>
这是使用 html5lib 解析的同一文档:
html5lib
的结果
BeautifulSoup("<a></p>", "html5lib")
<html><head></head><body><a><p></p></a></body></html>
结果为 html.parser
BeautifulSoup("<a></p>", "html.parser")
<a></a>
我正在尝试抓取此 website 以进行简单学习,我只是想使用 find_all()
命令打印该网站中的所有产品。标签为 tbody
和 class product-variant-list
的产品共有 12 个。但是我只有五个,我找不到这里的问题。
我的代码:
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.zooplus.co.uk/shop/dogs/dry_dog_food/royal_canin_vet_diet'
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html,"lxml")
product_list = soup.find_all("tbody", {"class":"product-variants-list"})
i=0
for product in product_list:
product_name = product.find("a",{"class":"follow3"}).find("b").text
print i, product_name
#product_variants = product.find_all("tr",{"class":"product-variant"})
i +=1
html是:
<table id="product-list" width="658" cellspacing="0" cellpadding="2" border="0">
<tbody class="products-header"></tbody>
<tbody class="product-variants-list">
<tr></tr>
<tr class="text" style="background-color:#ffffff;">
<td valign="middle" colspan="6">
<a class="follow3" title="Royal Canin Veterinary Diet - Hypoallergenic DR 21" href="/shop/dogs/dry_dog_food/royal_canin_vet_diet/307309">
<b>
Royal Canin Veterinary Diet - Hypoallergenic DR 21
</b>
::after
</a>
</td>
</tr>
<tr class="text" style="background-color:#ffffff;"></tr>
<tr class="text product-variant"></tr>
<tr class="text product-variant"></tr>
<tr class="text product-variant"></tr>
</tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="product-variants-list"></tbody>
<tbody class="product-adzone"></tbody>
<tbody class="products-footer"></tbody>
</table>
我的输出:
0 Royal Canin Veterinary Diet - Hypoallergenic DR 21
1 Royal Canin Veterinary Diet - Sensitivity Control SC 21
2 Royal Canin Veterinary Diet - Gastro Intestinal GI 25
3 Royal Canin Veterinary Diet - Renal RF 14
4 Royal Canin Veterinary Diet - Obesity Management DP 34
我认为你的错误是这一行:
soup = BeautifulSoup(html,"lxml")
如果您将 "lxml" 更改为 "html.parser",它将起作用。
这是完整的代码:
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.zooplus.co.uk/shop/dogs/dry_dog_food/royal_canin_vet_diet'
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html,"html.parser")
product_list = soup.find_all("tbody", {"class":"product-variants-list"})
i=0
for product in product_list:
product_name = product.find("a",{"class":"follow3"}).find("b").text
print i, product_name
#product_variants = product.find_all("tr",{"class":"product-variant"})
i +=1
结果是:
0 Royal Canin Veterinary Diet - Hypoallergenic DR 21
1 Royal Canin Veterinary Diet - Sensitivity Control SC 21
2 Royal Canin Veterinary Diet - Gastro Intestinal GI 25
3 Royal Canin Veterinary Diet - Renal RF 14
4 Royal Canin Veterinary Diet - Obesity Management DP 34
5 Royal Canin Veterinary Diet - Urinary S/O LP 18
6 Royal Canin Veterinary Diet - Mobility MS 25
7 Royal Canin Veterinary Diet - Satiety Support SAT 30
8 Royal Canin Veterinary Diet - Hepatic HF 16
9 Royal Canin Veterinary Diet - Dental DLK 22
10 Royal Canin Veterinary Diet - Diabetic DS 37
11 Royal Canin Veterinary Diet - Calm CD 25
希望对您有所帮助!
祝你有愉快的一天
** 更新 **
lxml 和 html 解析器之间的区别在这里得到了很好的解释:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers
If the document is not perfectly-formed, different parsers will give different results. Here’s a short, invalid document parsed using lxml’s HTML parser. Note that the dangling
tag is simply ignored:
使用 lxml 的结果
BeautifulSoup("<a></p>", "lxml")
<html><body><a></a></body></html>
这是使用 html5lib 解析的同一文档:
html5lib
的结果BeautifulSoup("<a></p>", "html5lib")
<html><head></head><body><a><p></p></a></body></html>
结果为 html.parser
BeautifulSoup("<a></p>", "html.parser")
<a></a>