我无法使用 BeautifulSoup4 获取完整数据

I am not able to get the full data using BeautifulSoup4

我正在尝试抓取此 website 以进行简单学习,我只是想使用 find_all() 命令打印该网站中的所有产品。标签为 tbody 和 class product-variant-list 的产品共有 12 个。但是我只有五个,我找不到这里的问题。

我的代码:

from bs4 import BeautifulSoup
import urllib2

url = 'http://www.zooplus.co.uk/shop/dogs/dry_dog_food/royal_canin_vet_diet'
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html,"lxml")

product_list = soup.find_all("tbody", {"class":"product-variants-list"})

i=0

for product in product_list:

    product_name = product.find("a",{"class":"follow3"}).find("b").text
    print i, product_name
    #product_variants = product.find_all("tr",{"class":"product-variant"})
    i +=1

html是:

<table id="product-list" width="658" cellspacing="0" cellpadding="2" border="0">

    <tbody class="products-header"></tbody>
    <tbody class="product-variants-list">
        <tr></tr>
        <tr class="text" style="background-color:#ffffff;">
            <td valign="middle" colspan="6">
                <a class="follow3" title="Royal Canin Veterinary Diet - Hypoallergenic DR 21" href="/shop/dogs/dry_dog_food/royal_canin_vet_diet/307309">
                    <b>

                        Royal Canin Veterinary Diet - Hypoallergenic DR 21

                    </b>
                    ::after
                </a>
            </td>
        </tr>
        <tr class="text" style="background-color:#ffffff;"></tr>
        <tr class="text product-variant"></tr>
        <tr class="text product-variant"></tr>
        <tr class="text product-variant"></tr>
    </tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="product-variants-list"></tbody>
    <tbody class="product-adzone"></tbody>
    <tbody class="products-footer"></tbody>

</table>

我的输出:

0 Royal Canin Veterinary Diet - Hypoallergenic DR 21
1 Royal Canin Veterinary Diet - Sensitivity Control SC 21
2 Royal Canin Veterinary Diet - Gastro Intestinal GI 25
3 Royal Canin Veterinary Diet - Renal RF 14
4 Royal Canin Veterinary Diet - Obesity Management DP 34

我认为你的错误是这一行:

soup = BeautifulSoup(html,"lxml")

如果您将 "lxml" 更改为 "html.parser",它将起作用。

这是完整的代码:

from bs4 import BeautifulSoup
import urllib2

url = 'http://www.zooplus.co.uk/shop/dogs/dry_dog_food/royal_canin_vet_diet'
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html,"html.parser")

product_list = soup.find_all("tbody", {"class":"product-variants-list"})

i=0

for product in product_list:

    product_name = product.find("a",{"class":"follow3"}).find("b").text
    print i, product_name
#product_variants = product.find_all("tr",{"class":"product-variant"})
i +=1

结果是:

0 Royal Canin Veterinary Diet - Hypoallergenic DR 21
1 Royal Canin Veterinary Diet - Sensitivity Control SC 21
2 Royal Canin Veterinary Diet - Gastro Intestinal GI 25
3 Royal Canin Veterinary Diet - Renal RF 14
4 Royal Canin Veterinary Diet - Obesity Management DP 34
5 Royal Canin Veterinary Diet - Urinary S/O LP 18
6 Royal Canin Veterinary Diet - Mobility MS 25
7 Royal Canin Veterinary Diet - Satiety Support SAT 30
8 Royal Canin Veterinary Diet - Hepatic HF 16
9 Royal Canin Veterinary Diet - Dental DLK 22
10 Royal Canin Veterinary Diet - Diabetic DS 37
11 Royal Canin Veterinary Diet - Calm CD 25

希望对您有所帮助!

祝你有愉快的一天

** 更新 **

lxml 和 html 解析器之间的区别在这里得到了很好的解释:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers

If the document is not perfectly-formed, different parsers will give different results. Here’s a short, invalid document parsed using lxml’s HTML parser. Note that the dangling

tag is simply ignored:

使用 lxml 的结果

BeautifulSoup("<a></p>", "lxml")
<html><body><a></a></body></html>

这是使用 html5lib 解析的同一文档:

html5lib

的结果
BeautifulSoup("<a></p>", "html5lib")
<html><head></head><body><a><p></p></a></body></html>

结果为 html.parser

BeautifulSoup("<a></p>", "html.parser")
<a></a>