网络抓取 link 时出错

Question

抓取给定的 link 时出错。谁能帮我解决这个错误，以及为 link 抓取网页以获取所有文本数据的代码。

from urllib.request import Request, urlopen
link='https://novelfull.com/warriors-promise/chapter-1.html'
req = Request(link) 
webpage = urlopen(req).read()

Answer 1

您可以尝试使用 requests:

>>> import requests
>>> res = requests.get("https://novelfull.com/warriors-promise/chapter-1.html")
>>> res.raise_for_status()
>>> res.text
'\r\n<!DOCTYPE html><html lang="en-US"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>...'

为了获取页面的内容（在本例中为实际故事），您可能需要网络抓取工具，例如 BeautifulSoup4 或lxml.

BeautifulSoup4

import bs4
import requests

res = requests.get("https://novelfull.com/warriors-promise/chapter-1.html")
soup = bs4.BeautifulSoup(res.text, features="html.parser")
elem = soup.select("#chapter-content div:nth-child(3) div")[0]
content = elem.getText()

BeautifulSoup4是第三方模块，一定要安装：pip install BeautifulSoup4.

lxml

from urllib.request import urlopen
from lxml import etree

res = urlopen("https://novelfull.com/warriors-promise/chapter-1.html")
htmlparser = etree.HTMLparser()
tree = etree.parse(res, htmlparser)
elem = tree.xpath("//div[@id='chapter-content']//div[3]//div")
content = elem.text

lxml是第三方模块，一定要安装：pip install lxml

Answer 2

在 header 中设置用户代理就像从浏览器调用一样似乎可以避免 HTTP 403: Forbidden 错误，例如：

from urllib.request import Request, urlopen
link='https://novelfull.com/warriors-promise/chapter-1.html'
req = Request(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
webpage = urlopen(req).read()

您还可以查看 this question 类似案例

网络抓取 link 时出错

Getting error while web scraping the link

python

web-scraping

data-wrangling

BeautifulSoup4

lxml