python 的 lxml 正则表达式

Regex in lxml for python

我在 xpath 命令中执行正则表达式时遇到问题。我这里的目标是下载主页的html内容,以及主页所有超链接的内容。但是,该程序会抛出异常,因为某些 href 链接未连接到任何内容(例如“//:javascript”或“#”)。我将如何在 xpath 中使用正则表达式?有没有更简单的方法来排除非绝对 href?

from lxml import html
import requests
main_pg = requests.get("http://gazetaolekma.ru/")
with open("Sample.html","w", encoding='utf-8') as doc:
    doc.write(main_pg.text)
tree = html.fromstring(main_pg.content)
hrefs = tree.xpath('//a[re:findall("^(http|https|ftp):.*")]/@href')
for href in hrefs:
    link_page = requests.get(href)
    with open("%s.html"%href[0:9], "w", encoding ='utf-8') as href_doc:
        href_doc.write(link_page.text)

根据 the documentationlxml 支持 EXSLT 扩展,进而支持正则表达式:

lxml supports XPath 1.0, XSLT 1.0 and the EXSLT extensions through libxml2 and libxslt in a standards compliant way.

例如,使用 EXSLT re:test() 函数:

....
ns = {'re': 'http://exslt.org/regular-expressions'}
hrefs = tree.xpath('//a[re:test(@href, "^(http|https|ftp):.*\b", "i")]/@href')
.....

对于 xpath 1.0,您始终可以在谓词中使用 or

hrefs = tree.xpath('//a/@href[starts-with(., "http") or starts-with(., "ftp")]')