无法从页面中抓取 <h3> 标记

Question

似乎我可以抓取任何标签和 class，除了此页面上的 h3。它不断返回 None 或一个空列表。我正在尝试获取此 h3 标签：

...在以下网页上：

https://www.empireonline.com/movies/features/best-movies-2/

这是我使用的代码：

from bs4 import BeautifulSoup
import requests
from pprint import pprint
from bs4 import BeautifulSoup

URL = "https://www.empireonline.com/movies/features/best-movies-2/"

response = requests.get(URL)
web_html = response.text

soup = BeautifulSoup(web_html, "html.parser")

movies = soup.findAll(name = "h3" , class_ = "jsx-4245974604")

movies_text=[]

for item in movies:
    result = item.getText()
    movies_text.append(result)

print(movies_text)

你能帮忙解决这个问题吗？

Answer 1

正如其他人提到的，这是动态内容，需要在 opening/running 网页时首先生成。因此你找不到 class "jsx-4245974604" with BS4.

如果您打印出“soup”变量，您实际上会发现您找不到它。但是如果你只是想得到电影的名字，你可以在这种情况下使用 html 的另一部分。

电影名称在图片的 alt 标签中（实际上也在 html 的许多其他部分）。

import requests

from pprint import pprint

from bs4 import BeautifulSoup

URL = "https://www.empireonline.com/movies/features/best-movies-2/"

response = requests.get(URL) 
web_html = response.text

soup = BeautifulSoup(web_html, "html.parser")


movies = soup.findAll("img", class_="jsx-952983560")

movies_text=[]

for item in movies: 
  result = item.get('alt')
  movies_text.append(result)

print(movies_text)

如果你以后运行遇到这个问题，记得只打印出最初的 html you can get with soup 并用眼睛检查是否可以找到你需要的信息。

无法从页面中抓取 <h3> 标记

Can't scrape <h3> tag from page

python

beautifulsoup