网络抓取 link 时出错
Getting error while web scraping the link
抓取给定的 link 时出错。谁能帮我解决这个错误,以及为 link 抓取网页以获取所有文本数据的代码。
from urllib.request import Request, urlopen
link='https://novelfull.com/warriors-promise/chapter-1.html'
req = Request(link)
webpage = urlopen(req).read()
您可以尝试使用 requests
:
>>> import requests
>>> res = requests.get("https://novelfull.com/warriors-promise/chapter-1.html")
>>> res.raise_for_status()
>>> res.text
'\r\n<!DOCTYPE html><html lang="en-US"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>...'
为了获取页面的 内容(在本例中为实际故事),您可能需要网络抓取工具,例如 BeautifulSoup4
或lxml
.
BeautifulSoup4
import bs4
import requests
res = requests.get("https://novelfull.com/warriors-promise/chapter-1.html")
soup = bs4.BeautifulSoup(res.text, features="html.parser")
elem = soup.select("#chapter-content div:nth-child(3) div")[0]
content = elem.getText()
BeautifulSoup4
是第三方模块,一定要安装:pip install BeautifulSoup4
.
lxml
from urllib.request import urlopen
from lxml import etree
res = urlopen("https://novelfull.com/warriors-promise/chapter-1.html")
htmlparser = etree.HTMLparser()
tree = etree.parse(res, htmlparser)
elem = tree.xpath("//div[@id='chapter-content']//div[3]//div")
content = elem.text
lxml
是第三方模块,一定要安装:pip install lxml
在 header 中设置用户代理就像从浏览器调用一样似乎可以避免 HTTP 403: Forbidden
错误,例如:
from urllib.request import Request, urlopen
link='https://novelfull.com/warriors-promise/chapter-1.html'
req = Request(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
webpage = urlopen(req).read()
您还可以查看 this question 类似案例
抓取给定的 link 时出错。谁能帮我解决这个错误,以及为 link 抓取网页以获取所有文本数据的代码。
from urllib.request import Request, urlopen
link='https://novelfull.com/warriors-promise/chapter-1.html'
req = Request(link)
webpage = urlopen(req).read()
您可以尝试使用 requests
:
>>> import requests
>>> res = requests.get("https://novelfull.com/warriors-promise/chapter-1.html")
>>> res.raise_for_status()
>>> res.text
'\r\n<!DOCTYPE html><html lang="en-US"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>...'
为了获取页面的 内容(在本例中为实际故事),您可能需要网络抓取工具,例如 BeautifulSoup4
或lxml
.
BeautifulSoup4
import bs4
import requests
res = requests.get("https://novelfull.com/warriors-promise/chapter-1.html")
soup = bs4.BeautifulSoup(res.text, features="html.parser")
elem = soup.select("#chapter-content div:nth-child(3) div")[0]
content = elem.getText()
BeautifulSoup4
是第三方模块,一定要安装:pip install BeautifulSoup4
.
lxml
from urllib.request import urlopen
from lxml import etree
res = urlopen("https://novelfull.com/warriors-promise/chapter-1.html")
htmlparser = etree.HTMLparser()
tree = etree.parse(res, htmlparser)
elem = tree.xpath("//div[@id='chapter-content']//div[3]//div")
content = elem.text
lxml
是第三方模块,一定要安装:pip install lxml
在 header 中设置用户代理就像从浏览器调用一样似乎可以避免 HTTP 403: Forbidden
错误,例如:
from urllib.request import Request, urlopen
link='https://novelfull.com/warriors-promise/chapter-1.html'
req = Request(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
webpage = urlopen(req).read()
您还可以查看 this question 类似案例