python 的 lxml 正则表达式

Question

我在 xpath 命令中执行正则表达式时遇到问题。我这里的目标是下载主页的html内容，以及主页所有超链接的内容。但是，该程序会抛出异常，因为某些 href 链接未连接到任何内容（例如“//:javascript”或“#”）。我将如何在 xpath 中使用正则表达式？有没有更简单的方法来排除非绝对 href？

from lxml import html
import requests
main_pg = requests.get("http://gazetaolekma.ru/")
with open("Sample.html","w", encoding='utf-8') as doc:
    doc.write(main_pg.text)
tree = html.fromstring(main_pg.content)
hrefs = tree.xpath('//a[re:findall("^(http|https|ftp):.*")]/@href')
for href in hrefs:
    link_page = requests.get(href)
    with open("%s.html"%href[0:9], "w", encoding ='utf-8') as href_doc:
        href_doc.write(link_page.text)

Answer 1

根据 the documentation，lxml 支持 EXSLT 扩展，进而支持正则表达式：

lxml supports XPath 1.0, XSLT 1.0 and the EXSLT extensions through libxml2 and libxslt in a standards compliant way.

例如，使用 EXSLT re:test() 函数：

....
ns = {'re': 'http://exslt.org/regular-expressions'}
hrefs = tree.xpath('//a[re:test(@href, "^(http|https|ftp):.*\b", "i")]/@href')
.....

Answer 2

对于 xpath 1.0，您始终可以在谓词中使用 or：

hrefs = tree.xpath('//a/@href[starts-with(., "http") or starts-with(., "ftp")]')

python 的 lxml 正则表达式

Regex in lxml for python

python

regex

xpath

html-parsing