beautifulSoup 不一致的行为

Question

我对以下 HTML-我在两个不同环境中编写的抓取代码的行为感到非常困惑，需要帮助找到这种差异的根本原因 .

import sys
import bs4
import md5
import logging
from urllib2 import urlopen
from platform import platform

# Log particulars of the environment
logging.warning("OS platform is %s" %platform())
logging.warning("Python version is %s" %sys.version)
logging.warning("BeautifulSoup is at %s and its version is %s" %(bs4.__file__, bs4.__version__))

# Open web-page and read HTML
url = 'http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=JXIG&size=all'
response = urlopen(url)
html = response.read()

# Calculate MD5 to ensure that the same string was downloaded
print "MD5 sum for html string downloaded is %s" %md5.new(html).hexdigest()

# Make beautiful soup
soup = bs4.BeautifulSoup(html, 'html')
contigsTable = soup.find("table", {"class" : "zebra"})
contigs = []

# Parse table in soup to find all records
for row in contigsTable.findAll('tr'):
    column = row.findAll('td')
    if len(column) > 2:
        contigs.append(column[1])

# Expect identical results on any machine that this is run
print "Number of contigs identified is %s" %len(contigs)

在机器 1 上，运行到 return:

WARNING:root:OS platform is Linux-3.10.10-031010-generic-x86_64-with-Ubuntu-12.04-precise   
WARNING:root:Python version is 2.7.3 (default, Jun 22 2015, 19:33:41)  
[GCC 4.6.3]  
WARNING:root:BeautifulSoup is at /usr/local/lib/python2.7/dist-packages/bs4/__init__.pyc and its version is 4.3.2  
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf  

Number of contigs identified is 630

在机器 2 上，这段完全相同的代码运行到 return:

WARNING:root:OS platform is Linux-2.6.32-431.46.2.el6.nersc.x86_64-x86_64-with-debian-6.0.6
WARNING:root:Python version is 2.7.4 (default, Apr 17 2013, 10:26:13) 
[GCC 4.6.3]
WARNING:root:BeautifulSoup is at /global/homes/i/img/.local/lib/python2.7/site-packages/bs4/__init__.pyc and its version is 4.3.2
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf

Number of contigs identified is 462

计算的重叠群数量不同。请注意，相同的代码解析 HTML table 在两个不同的环境中会产生不同的结果彼此之间并没有明显的不同，不幸的是导致了这场生产噩梦。人工检查确认 return 在 机器 2 上的结果不正确，但到目前为止无法解释。

有没有人有类似的经历？您是否注意到此代码有任何问题，或者我是否应该完全停止信任 BeautifulSoup？

Answer 1

您所指定的 "html" 标记类型出现 differences between parsers that BeaufitulSoup chooses automatically。选择哪个解析器取决于当前 Python 环境中可用的模块：

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

要在所有平台上保持一致的行为，请明确：

soup = BeautifulSoup(html, "html.parser")
soup = BeautifulSoup(html, "html5lib")
soup = BeautifulSoup(html, "lxml")

另请参阅：Installing a parser。

beautifulSoup 不一致的行为

beautifulSoup inconsistent behavior

python

beautifulsoup

html-parsing

web-scraping

python-2.7

在机器 1 上，运行到 return:

在机器 2 上，这段完全相同的代码运行到 return: