beautifulsoup find_all 没有找到全部

Question

以下页面是我试图从中收集信息的示例页面。 https://www.hockey-reference.com/boxscores/201610130TBL.html 有点难说，但实际上有 8 个 table，因为它用与另一个 table 相同的 class 名称调用得分摘要和惩罚摘要s.

我正在尝试使用以下代码访问 tables，稍作修改以尝试解决问题。

import os
from bs4 import BeautifulSoup # imports BeautifulSoup

file = open("Detroit_vs_Tampa.txt")
data = file.read()
file.close()

soup = BeautifulSoup(data,'lxml')
get_table = soup.find_all(class_="overthrow table_container")

print(len(get_table))

而我这段代码的输出是6，这显然是不对的。我进一步了解到，它遗漏的 table 是高级统计报告 header 下方的两个 table。

我还想指出，因为我认为这可能是解析器的问题，所以我尝试直接从网站上使用 html.parser 和 html.parser/lxml （与我在示例代码中使用的文本文件相反）所以我不认为它是损坏的 html。

我让一个朋友快速浏览了一下，认为这可能是我自己的一个小疏忽，他注意到该网站正在使用旧的 IE hack 并在 table带有评论标签

我不是 100% 确定这就是它不起作用的原因，但我用谷歌搜索了这个问题，但一无所获。我希望这里有人能给我指明正确的方向。

Answer 1

最后的表格由 js 加载，但正如您所注意到的，它们也嵌入在注释标记内的静态 html 中。如果您搜索 Comment 对象，则可以使用 bs4 获取它们。

import requests
from bs4 import BeautifulSoup, Comment

url = 'https://www.hockey-reference.com/boxscores/201610130TBL.html'
data = requests.get(url).text
soup = BeautifulSoup(data,'lxml')
get_table = soup.find_all(class_="overthrow table_container")
comment = soup.find(text=lambda text:isinstance(text, Comment) and 'table_container' in text)
get_table += BeautifulSoup(comment.string,'lxml').find_all(class_="overthrow table_container")
print(len(get_table))

或者您可以使用 selenium，但它比 urllib 或 requests 重得多。

from selenium import webdriver
from bs4 import BeautifulSoup 

url = 'https://www.hockey-reference.com/boxscores/201610130TBL.html'
driver = webdriver.Firefox()
driver.get(url)
data = driver.page_source
driver.quit()

soup = BeautifulSoup(data,'lxml')
get_table = soup.find_all(class_="overthrow table_container")
print(len(get_table))

beautifulsoup find_all 没有找到全部

beautifulsoup find_all not finding all

html

python

lxml

beautifulsoup