使用 python 请求和重定向获得 html

Question

我要抓取页面url = 'https://e-justice.europa.eu/searchBris.do'提交我自己的信息。我使用 requests.get(url) 获取页面的 html 内容。

requests.get(url)

但我得到重定向页面作为请求的输出如下：

\n\n\n\n\n\n\n<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<html lang="en">\n    <head>\n    <title>Find a company</title>\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n\n    <script> \n        top.location.reload();\n    </script>\n\n    <noscript><meta http-equiv="refresh" content="0;url=https://e-justice.europa.eu/searchBris.do"/></noscript>\n    </head>\n    <body>\n        <h1>Redirecting...</h1>\n    </body>\n</html>

我也测试了 allow_redirect 选项和 session.get() 和 session.post() 解决方案如下，但是重定向输出仍然存在并且 [=28= 的访问] 的 url 被拒绝。

requests.get(url, allow_redirects=True)
session.get(url, allow_redirects=True)
requests.post(url, allow_redirects=True)
session.post(url, allow_redirects=True)

有什么办法可以获取到原文的内容url？

Answer 1

尽管它声称如此，但该页面并未使用传统重定向，您可以查看：

url = 'https://e-justice.europa.eu/searchBris.do'
r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})  # spoof UA just in case
r.is_redirect
> False

发生的事情在 <noscript> 标签中。该站点是使用客户端 Javascript 呈现的，因此您不能使用 HTML 抓取工具（没有浏览器）来完成此操作。

您可以尝试使用带有 Selenium 的无头浏览器。

Answer 2

我试过phantomjs抓取本站的html，成功了。

driver = webdriver.PhantomJS()
driver.get(url)
html = str(driver.page_source)

使用 python 请求和重定向获得 html

get html with python request and redirection

beautifulsoup

web-crawler

request

python-3.x