使用 Python（漂亮的汤）抓取需要单击 "I agree to cookies" 按钮的网页？

Question

我正在尝试为当天的所有足球（足球）比赛抓取以下 URL：https://www.soccerstats.com/matches.asp?matchday=2&daym=tomorrow

我的代码曾经有效，但网站已经更改，您现在需要在网站加载页面之前单击 "I agree to cookies" 按钮。这现在导致我的代码出现问题。有解决办法吗？

非常感谢任何帮助。

我已经尝试查看 bs4 的文本输出，很明显该站点尚未加载，但可以在输出中看到 "I agree to cookies" 文本，这意味着它没有通过这个阶段。

from bs4 import BeautifulSoup
import requests

url = "https://www.soccerstats.com/matches.asp?matchday=2"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
all_matches = []

all_matches = re.findall(r"""<a class='button' style='background-color:#AAAAAA;font-color=white;' href='(.*?)'>""", data)

输出应列出 url 个匹配项。

Answer 1

当您点击 "I agree to cookies" 时，网站会向您的浏览器发送一个 cookie，它基本上会告诉网站 "This user has agreed to cookies." 您可以在 Chrome 的 DevTools 之类的工具中捕获此 cookie，方法是打开“应用程序”选项卡并单击左侧的 "Cookies"，然后导航至您所在的网站。

完成后，单击 "I agree to cookies" 并查看向您的浏览器添加了哪些 cookie。在我查看的网站上，其中一个添加的 cookie 名为 __hs_opt_out，值为 no。然后，您可以简单地 add that cookie to your request:

r = requests.get(url, cookies={'__hs_opt_out': 'no'})

或者，甚至更好：

s = requests.Session()
s.cookies.update({'__hs_opt_out': 'no'})
s.get(url)  # Automatically uses the session cookies

# Some more code...

s.get(other_url)  # Remembers the cookie from before

使用 Python（漂亮的汤）抓取需要单击 "I agree to cookies" 按钮的网页？

Scraping a webpage using Python (beautiful soup) that requires "I agree to cookies" button being clicked?

python

cookies

screen-scraping

beautifulsoup