BeautifulSoup 不返回页面标题

Question

我尝试使用 Beautifulsoup4 python 模块通过网页抓取来获取网页的标题，但它返回了一个字符串“Not Acceptable!”作为标题，但是当我通过浏览器打开网页时，标题不同。我尝试遍历 link 列表并提取所有网页的标题，但它返回相同的字符串“Not Acceptable!”对于所有 links.

这里是 python 代码

from bs4 import BeautifulSoup
import requests


URL = 'https://insights.blackcoffer.com/how-is-login-logout-time-tracking-for-employees-in-office-done-by-ai/'
result = requests.get(URL)
doc = BeautifulSoup(result.text, 'html.parser')
tag = doc.title
print(tag.get_text())

这里是link到对应的网页webpage link

不知道是Beautifulsoup4的问题还是requests库的问题，是不是因为站点启用了bot保护，发送请求的时候没有返回HTML？

Answer 1

调试此类问题的一种简单方法是打印（或写入文件）request.text。这是因为某些服务器不允许抓取。一些网站在运行时使用 JavaScript 生成 HTML（例如 YouTube）。这些是 request.text 可能与我们在浏览器中看到的来源 HTML 不同的一些场景。服务器已返回以下文本。

<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>

编辑：正如 DYZ 所指出的，这是一个 406 错误，请求中的用户代理 header 丢失。

https://www.exai.com/blog/406-not-acceptable

The 406 Not Acceptable status code is a client-side error. It's part of the HTTP response status codes in the 4xx category, which are considered client error responses

Answer 2

服务器需要 User-Agent header。有趣的是，它对 any User-Agent 很满意，即使是虚构的：

result = requests.get(URL, headers = {'User-Agent': 'My User Agent 1.0'})

BeautifulSoup 不返回页面标题

BeautifulSoup not returning the title of page

python

web-scraping

python-3.x

python-requests