beautifulSoup 不一致的行为
beautifulSoup inconsistent behavior
我对以下 HTML-我在两个不同环境中编写的抓取代码的行为感到非常困惑,需要帮助找到这种差异的根本原因 .
import sys
import bs4
import md5
import logging
from urllib2 import urlopen
from platform import platform
# Log particulars of the environment
logging.warning("OS platform is %s" %platform())
logging.warning("Python version is %s" %sys.version)
logging.warning("BeautifulSoup is at %s and its version is %s" %(bs4.__file__, bs4.__version__))
# Open web-page and read HTML
url = 'http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=JXIG&size=all'
response = urlopen(url)
html = response.read()
# Calculate MD5 to ensure that the same string was downloaded
print "MD5 sum for html string downloaded is %s" %md5.new(html).hexdigest()
# Make beautiful soup
soup = bs4.BeautifulSoup(html, 'html')
contigsTable = soup.find("table", {"class" : "zebra"})
contigs = []
# Parse table in soup to find all records
for row in contigsTable.findAll('tr'):
column = row.findAll('td')
if len(column) > 2:
contigs.append(column[1])
# Expect identical results on any machine that this is run
print "Number of contigs identified is %s" %len(contigs)
在机器 1 上,运行到 return:
WARNING:root:OS platform is Linux-3.10.10-031010-generic-x86_64-with-Ubuntu-12.04-precise
WARNING:root:Python version is 2.7.3 (default, Jun 22 2015, 19:33:41)
[GCC 4.6.3]
WARNING:root:BeautifulSoup is at /usr/local/lib/python2.7/dist-packages/bs4/__init__.pyc and its version is 4.3.2
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf
Number of contigs identified is 630
在机器 2 上,这段完全相同的代码运行到 return:
WARNING:root:OS platform is Linux-2.6.32-431.46.2.el6.nersc.x86_64-x86_64-with-debian-6.0.6
WARNING:root:Python version is 2.7.4 (default, Apr 17 2013, 10:26:13)
[GCC 4.6.3]
WARNING:root:BeautifulSoup is at /global/homes/i/img/.local/lib/python2.7/site-packages/bs4/__init__.pyc and its version is 4.3.2
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf
Number of contigs identified is 462
计算的重叠群数量不同。请注意,相同的代码解析 HTML table 在两个不同的环境中会产生不同的结果彼此之间并没有明显的不同,不幸的是导致了这场生产噩梦。人工检查确认 return 在 机器 2 上的结果不正确,但到目前为止无法解释。
有没有人有类似的经历?您是否注意到此代码有任何问题,或者我是否应该完全停止信任 BeautifulSoup
?
您所指定的 "html" 标记类型出现 differences between parsers that BeaufitulSoup
chooses automatically。选择哪个解析器取决于当前 Python 环境中可用的模块:
If you don’t specify anything, you’ll get the best HTML parser that’s
installed. Beautiful Soup ranks lxml’s parser as being the best, then
html5lib’s, then Python’s built-in parser.
要在所有平台上保持一致的行为,请明确:
soup = BeautifulSoup(html, "html.parser")
soup = BeautifulSoup(html, "html5lib")
soup = BeautifulSoup(html, "lxml")
另请参阅:Installing a parser。
我对以下 HTML-我在两个不同环境中编写的抓取代码的行为感到非常困惑,需要帮助找到这种差异的根本原因 .
import sys
import bs4
import md5
import logging
from urllib2 import urlopen
from platform import platform
# Log particulars of the environment
logging.warning("OS platform is %s" %platform())
logging.warning("Python version is %s" %sys.version)
logging.warning("BeautifulSoup is at %s and its version is %s" %(bs4.__file__, bs4.__version__))
# Open web-page and read HTML
url = 'http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=JXIG&size=all'
response = urlopen(url)
html = response.read()
# Calculate MD5 to ensure that the same string was downloaded
print "MD5 sum for html string downloaded is %s" %md5.new(html).hexdigest()
# Make beautiful soup
soup = bs4.BeautifulSoup(html, 'html')
contigsTable = soup.find("table", {"class" : "zebra"})
contigs = []
# Parse table in soup to find all records
for row in contigsTable.findAll('tr'):
column = row.findAll('td')
if len(column) > 2:
contigs.append(column[1])
# Expect identical results on any machine that this is run
print "Number of contigs identified is %s" %len(contigs)
在机器 1 上,运行到 return:
WARNING:root:OS platform is Linux-3.10.10-031010-generic-x86_64-with-Ubuntu-12.04-precise
WARNING:root:Python version is 2.7.3 (default, Jun 22 2015, 19:33:41)
[GCC 4.6.3]
WARNING:root:BeautifulSoup is at /usr/local/lib/python2.7/dist-packages/bs4/__init__.pyc and its version is 4.3.2
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf
Number of contigs identified is 630
在机器 2 上,这段完全相同的代码运行到 return:
WARNING:root:OS platform is Linux-2.6.32-431.46.2.el6.nersc.x86_64-x86_64-with-debian-6.0.6
WARNING:root:Python version is 2.7.4 (default, Apr 17 2013, 10:26:13)
[GCC 4.6.3]
WARNING:root:BeautifulSoup is at /global/homes/i/img/.local/lib/python2.7/site-packages/bs4/__init__.pyc and its version is 4.3.2
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf
Number of contigs identified is 462
计算的重叠群数量不同。请注意,相同的代码解析 HTML table 在两个不同的环境中会产生不同的结果彼此之间并没有明显的不同,不幸的是导致了这场生产噩梦。人工检查确认 return 在 机器 2 上的结果不正确,但到目前为止无法解释。
有没有人有类似的经历?您是否注意到此代码有任何问题,或者我是否应该完全停止信任 BeautifulSoup
?
您所指定的 "html" 标记类型出现 differences between parsers that BeaufitulSoup
chooses automatically。选择哪个解析器取决于当前 Python 环境中可用的模块:
If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.
要在所有平台上保持一致的行为,请明确:
soup = BeautifulSoup(html, "html.parser")
soup = BeautifulSoup(html, "html5lib")
soup = BeautifulSoup(html, "lxml")
另请参阅:Installing a parser。