使用 'requests-html' 时如何使用绝对链接路径获取原始 html
How to get raw html with absolute links paths when using 'requests-html'
使用 requests
库向 https://whosebug.com
发出请求时
page = requests.get(url='https://whosebug.com')
print(page.content)
我得到以下信息:
<!DOCTYPE html>
<html class="html__responsive html__unpinned-leftnav">
<head>
<title>Stack Overflow - Where Developers Learn, Share, & Build Careers</title>
<link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/Whosebug/Img/favicon.ico?v=ec617d715196">
<link rel="apple-touch-icon" href="https://cdn.sstatic.net/Sites/Whosebug/Img/apple-touch-icon.png?v=c78bd457575a">
<link rel="image_src" href="https://cdn.sstatic.net/Sites/Whosebug/Img/apple-touch-icon.png?v=c78bd457575a">
..........
这里的这些源代码有绝对路径,但是当运行相同URL使用requests-html
和js渲染
with HTMLSession() as session:
page = session.get('https://whosebug.com')
page.html.render()
print(page.content)
我得到以下信息:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Whosebug.org</title>
<script type="text/javascript" src="lib/jquery.js"></script>
<script type="text/javascript" src="lib/interface.js"></script>
<script type="text/javascript" src="lib/window.js"></script>
<link href="lib/dock.css" rel="stylesheet" type="text/css" />
<link href="lib/window.css" rel="stylesheet" type="text/css" />
<link rel="icon" type="image/gif" href="favicon.gif"/>
..........
这里的链接都是相对路径,
js渲染使用requests-html
时如何获取requests
等绝对路径的源码?
this link 中模块的文档提到了绝对链接和相对链接之间的区别。
引用:
Grab a list of all links on the page, in absolute form (anchors
excluded):
r.html.absolute_links
你能试试这个语句吗?
这可能是 request-html developers 的功能请求。但是现在我们可以通过这个 hackish 解决方案来实现:
from requests_html import HTMLSession
from lxml import etree
with HTMLSession() as session:
html = session.get('https://whosebug.com').html
html.render()
# iterate over all links
for link in html.pq('a'):
if "href" in link.attrib:
# Make links absolute
link.attrib["href"] = html._make_absolute(link.attrib["href"])
# Print html with only absolute links
print(etree.tostring(html.lxml).decode())
我们通过遍历所有链接并使用 html-对象的私有 _make_absolute
函数将它们的位置更改为绝对位置,从而更改底层 lxml 树的 html-对象。
使用 requests
库向 https://whosebug.com
page = requests.get(url='https://whosebug.com')
print(page.content)
我得到以下信息:
<!DOCTYPE html>
<html class="html__responsive html__unpinned-leftnav">
<head>
<title>Stack Overflow - Where Developers Learn, Share, & Build Careers</title>
<link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/Whosebug/Img/favicon.ico?v=ec617d715196">
<link rel="apple-touch-icon" href="https://cdn.sstatic.net/Sites/Whosebug/Img/apple-touch-icon.png?v=c78bd457575a">
<link rel="image_src" href="https://cdn.sstatic.net/Sites/Whosebug/Img/apple-touch-icon.png?v=c78bd457575a">
..........
这里的这些源代码有绝对路径,但是当运行相同URL使用requests-html
和js渲染
with HTMLSession() as session:
page = session.get('https://whosebug.com')
page.html.render()
print(page.content)
我得到以下信息:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Whosebug.org</title>
<script type="text/javascript" src="lib/jquery.js"></script>
<script type="text/javascript" src="lib/interface.js"></script>
<script type="text/javascript" src="lib/window.js"></script>
<link href="lib/dock.css" rel="stylesheet" type="text/css" />
<link href="lib/window.css" rel="stylesheet" type="text/css" />
<link rel="icon" type="image/gif" href="favicon.gif"/>
..........
这里的链接都是相对路径,
js渲染使用requests-html
时如何获取requests
等绝对路径的源码?
this link 中模块的文档提到了绝对链接和相对链接之间的区别。
引用:
Grab a list of all links on the page, in absolute form (anchors excluded):
r.html.absolute_links
你能试试这个语句吗?
这可能是 request-html developers 的功能请求。但是现在我们可以通过这个 hackish 解决方案来实现:
from requests_html import HTMLSession
from lxml import etree
with HTMLSession() as session:
html = session.get('https://whosebug.com').html
html.render()
# iterate over all links
for link in html.pq('a'):
if "href" in link.attrib:
# Make links absolute
link.attrib["href"] = html._make_absolute(link.attrib["href"])
# Print html with only absolute links
print(etree.tostring(html.lxml).decode())
我们通过遍历所有链接并使用 html-对象的私有 _make_absolute
函数将它们的位置更改为绝对位置,从而更改底层 lxml 树的 html-对象。