Beautifulsoup return 抓取 YouTube 频道时为空列表

Beautfiul Soup return empty list when scraping YouTube chanel

我正在尝试使用此代码获取有关 YouTube 频道的一些 public 信息(API 不适合此任务)。

代码示例:

import re
import json
import requests
from bs4 import BeautifulSoup

URL = "https://www.youtube.com/c/Rozziofficial/about"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")

# We locate the JSON data using a regular-expression pattern
data = re.search(r"var ytInitialData = ({.*});", str(soup)).group(1)

# Uncomment to view all the data
# print(json.dumps(data))

# This converts the JSON data to a python dictionary (dict)
json_data = json.loads(data)

# This is the info from the webpage on the right-side under "stats", it contains the data you want
stats = json_data["contents"]["twoColumnBrowseResultsRenderer"]["tabs"][5]["tabRenderer"]["content"]["sectionListRenderer"]["contents"][0]["itemSectionRenderer"]["contents"][0]["channelAboutFullMetadataRenderer"]

print("Channel Views:", stats["viewCountText"]["simpleText"])
print("Joined:", stats["joinedDateText"]["runs"][1]["text"])

预期结果(6 个月前效果很好):

Joined: Jun 30, 2007

。 . 但是现在得到了:

AttributeError: 'NoneType' object has no attribute 'group'

回溯显示错误在此行:

data = re.search(r"var ytInitialData = ({.*});", str(soup)).group(1)

你能帮忙解决这个问题,让这段代码继续工作和 return 数据吗?

感谢任何帮助, 谢谢

您实际上根本没有在这里使用 BeautifulSoup。您只是获取原始文本并在其中搜索字符串。

这是网页抓取的问题。 YouTube 已更改其 JavaScript,并且该变量不再存在。我们不知道您要查找什么,但您当前的方法行不通。您实际上可能需要使用 Selenium 来 运行 Javascript 并从 DOM.

中提取信息

您的代码运行良好

import re
import json
import requests
from bs4 import BeautifulSoup

URL = "https://www.youtube.com/c/Rozziofficial/about"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")

# We locate the JSON data using a regular-expression pattern
data = re.search(r"var ytInitialData = ({.*});", str(soup)).group(1)

# Uncomment to view all the data
# print(json.dumps(data))

# This converts the JSON data to a python dictionary (dict)
json_data = json.loads(data)

# This is the info from the webpage on the right-side under "stats", it contains the data you want
stats = json_data["contents"]["twoColumnBrowseResultsRenderer"]["tabs"][5]["tabRenderer"]["content"]["sectionListRenderer"]["contents"][0]["itemSectionRenderer"]["contents"][0]["channelAboutFullMetadataRenderer"]

print("Channel Views:", stats["viewCountText"]["simpleText"])
print("Joined:", stats["joinedDateText"]["runs"][1]["text"])

输出:

Channel Views: 1,12,94,125টি ভিউ
Joined: 30 জুন, 2007