python 的 lxml 正则表达式
Regex in lxml for python
我在 xpath 命令中执行正则表达式时遇到问题。我这里的目标是下载主页的html内容,以及主页所有超链接的内容。但是,该程序会抛出异常,因为某些 href 链接未连接到任何内容(例如“//:javascript”或“#”)。我将如何在 xpath 中使用正则表达式?有没有更简单的方法来排除非绝对 href?
from lxml import html
import requests
main_pg = requests.get("http://gazetaolekma.ru/")
with open("Sample.html","w", encoding='utf-8') as doc:
doc.write(main_pg.text)
tree = html.fromstring(main_pg.content)
hrefs = tree.xpath('//a[re:findall("^(http|https|ftp):.*")]/@href')
for href in hrefs:
link_page = requests.get(href)
with open("%s.html"%href[0:9], "w", encoding ='utf-8') as href_doc:
href_doc.write(link_page.text)
根据 the documentation,lxml
支持 EXSLT 扩展,进而支持正则表达式:
lxml supports XPath 1.0, XSLT 1.0 and the EXSLT extensions through libxml2 and libxslt in a standards compliant way.
例如,使用 EXSLT re:test()
函数:
....
ns = {'re': 'http://exslt.org/regular-expressions'}
hrefs = tree.xpath('//a[re:test(@href, "^(http|https|ftp):.*\b", "i")]/@href')
.....
对于 xpath 1.0,您始终可以在谓词中使用 or
:
hrefs = tree.xpath('//a/@href[starts-with(., "http") or starts-with(., "ftp")]')
我在 xpath 命令中执行正则表达式时遇到问题。我这里的目标是下载主页的html内容,以及主页所有超链接的内容。但是,该程序会抛出异常,因为某些 href 链接未连接到任何内容(例如“//:javascript”或“#”)。我将如何在 xpath 中使用正则表达式?有没有更简单的方法来排除非绝对 href?
from lxml import html
import requests
main_pg = requests.get("http://gazetaolekma.ru/")
with open("Sample.html","w", encoding='utf-8') as doc:
doc.write(main_pg.text)
tree = html.fromstring(main_pg.content)
hrefs = tree.xpath('//a[re:findall("^(http|https|ftp):.*")]/@href')
for href in hrefs:
link_page = requests.get(href)
with open("%s.html"%href[0:9], "w", encoding ='utf-8') as href_doc:
href_doc.write(link_page.text)
根据 the documentation,lxml
支持 EXSLT 扩展,进而支持正则表达式:
lxml supports XPath 1.0, XSLT 1.0 and the EXSLT extensions through libxml2 and libxslt in a standards compliant way.
例如,使用 EXSLT re:test()
函数:
....
ns = {'re': 'http://exslt.org/regular-expressions'}
hrefs = tree.xpath('//a[re:test(@href, "^(http|https|ftp):.*\b", "i")]/@href')
.....
对于 xpath 1.0,您始终可以在谓词中使用 or
:
hrefs = tree.xpath('//a/@href[starts-with(., "http") or starts-with(., "ftp")]')