从 xpath 中剥离信息?
Strip information from xpath?
我使用以下代码行从网页中获取 CVE id:
project.cve_information = "".join(xpath_parse(tree, '//div[@id="references"]/a/text()')).split()
但是,问题是:
<div id='references'>
<b>References:</b>
<a href='https://access.redhat.com/security/cve/CVE-2011-3256' target='_blank'>CVE-2011-3256 <i class='icon-external-link'></i></a>
<a href='https://rhn.redhat.com/errata/RHSA-2011-1402.html' target='_blank'>RHSA-2011-1402 <i class='icon-external-link'></i></a><br />
</div>
参考:CVE-xxxx-xxxx RHSA-xxxx-xxxx
如何避免 RHSA 和此类条目被解析?我只想要 CVE-xxxx-xxxx 值。我用它来提交这样的表格:
"form[CVEID]" : ",".join(self.cve_information) if self.cve_information else "GENERIC-MAP-NOMATCH",
由于我的代码往往包含 RHSA 值,因此此表单仅对 CVE 值和错误执行验证。
您可以使用 包含:
h = """ <div id='references'>
<b>References:</b>
<a href='https://access.redhat.com/security/cve/CVE-2011-3256' target='_blank'>CVE-2011-3256 <i class='icon-external-link'></i></a>
<a href='https://rhn.redhat.com/errata/RHSA-2011-1402.html' target='_blank'>RHSA-2011-1402 <i class='icon-external-link'></i></a><br />
</div>"""
from lxml import html
xml = html.fromstring(h)
urls = xml.xpath('//div[@id="references"]/a[contains(@href, "CVE")]/@href')
或者,如果您想忽略 RHSA 的 href,您可以使用 not contains:
urls = xml.xpath('//div[@id="references"]/a[not(contains(@href, "RHSA"))]/@href')
两者都会给你:
['https://access.redhat.com/security/cve/CVE-2011-3256']
我使用以下代码行从网页中获取 CVE id:
project.cve_information = "".join(xpath_parse(tree, '//div[@id="references"]/a/text()')).split()
但是,问题是:
<div id='references'>
<b>References:</b>
<a href='https://access.redhat.com/security/cve/CVE-2011-3256' target='_blank'>CVE-2011-3256 <i class='icon-external-link'></i></a>
<a href='https://rhn.redhat.com/errata/RHSA-2011-1402.html' target='_blank'>RHSA-2011-1402 <i class='icon-external-link'></i></a><br />
</div>
参考:CVE-xxxx-xxxx RHSA-xxxx-xxxx
如何避免 RHSA 和此类条目被解析?我只想要 CVE-xxxx-xxxx 值。我用它来提交这样的表格:
"form[CVEID]" : ",".join(self.cve_information) if self.cve_information else "GENERIC-MAP-NOMATCH",
由于我的代码往往包含 RHSA 值,因此此表单仅对 CVE 值和错误执行验证。
您可以使用 包含:
h = """ <div id='references'>
<b>References:</b>
<a href='https://access.redhat.com/security/cve/CVE-2011-3256' target='_blank'>CVE-2011-3256 <i class='icon-external-link'></i></a>
<a href='https://rhn.redhat.com/errata/RHSA-2011-1402.html' target='_blank'>RHSA-2011-1402 <i class='icon-external-link'></i></a><br />
</div>"""
from lxml import html
xml = html.fromstring(h)
urls = xml.xpath('//div[@id="references"]/a[contains(@href, "CVE")]/@href')
或者,如果您想忽略 RHSA 的 href,您可以使用 not contains:
urls = xml.xpath('//div[@id="references"]/a[not(contains(@href, "RHSA"))]/@href')
两者都会给你:
['https://access.redhat.com/security/cve/CVE-2011-3256']