在文档中找不到我知道的标签 - find_all() returns []

Question

我正在使用 bs4 在 khanacademy 上抓取 https://www.khanacademy.org/profile/DFletcher1990/ 一个用户个人资料。

我正在尝试获取用户统计数据（加入日期、获得的能量点数、完成的视频）。

我有检查https://www.crummy.com/software/BeautifulSoup/bs4/doc/

好像是："The most common type of unexpected behavior is that you can’t find a tag that you know is in the document. You saw it going in, but find_all() returns [] or find() returns None. This is another common problem with Python’s built-in HTML parser, which sometimes skips tags it doesn’t understand. Again, the solution is to install lxml or html5lib."

我尝试了不同的解析器方法，但我遇到了同样的问题。

from bs4 import BeautifulSoup
import requests

url = 'https://www.khanacademy.org/profile/DFletcher1990/'

res = requests.get(url)

soup = BeautifulSoup(res.content, "lxml")

print(soup.find_all('div', class_='profile-widget-section'))

我的代码正在返回 []。

Answer 1

使用 javascript 加载页面内容。检查内容是否动态的最简单方法是右键单击并查看页面源并检查内容是否存在。您也可以在浏览器中关闭 javascript 并转到 url.

可以使用selenium获取内容

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path='/path/to/chromedriver')
driver.get("https://www.khanacademy.org/profile/DFletcher1990/")
element=WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH ,'//*[@id="widget-list"]/div[1]/div[1]/div[2]/div/div[2]/table')))
source=driver.page_source
soup=BeautifulSoup(source,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
for tr in user_info_table.find_all('tr'):
    tds=tr.find_all('td')
    print(tds[0].text,":",tds[1].text)

输出：

Date joined : 4 years ago
Energy points earned : 932,915
Videos completed : 372

另一个可用选项（因为您已经熟悉请求）是使用 requests-html

from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.khanacademy.org/profile/DFletcher1990/')
r.html.render(sleep=10)
soup=BeautifulSoup(r.html.html,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
for tr in user_info_table.find_all('tr'):
    tds=tr.find_all('td')
    print(tds[0].text,":",tds[1].text)

输出

Date joined : 4 years ago
Energy points earned : 932,915
Videos completed : 372

另一种选择是找出正在发出的 ajax 请求并模拟它并解析响应。此响应不必总是 json。但在这种情况下，内容不会通过 ajax 响应发送到浏览器。它已经存在于页面源代码中。

该页面仅使用 javascript 来构建此信息。您可以尝试从该脚本标记中获取数据，这可能涉及一些正则表达式，然后从字符串中生成 json。

在文档中找不到我知道的标签 - find_all() returns []

Can’t find a tag that I know is in the document - find_all() returns []

beautifulsoup

html-parsing

web-scraping

python-3.x