使用 lxml 进行网页抓取
Web scraping with lxml
如何使用xpath提取标签之间的文本?例如,我试图提取以 "Area:" 开头的文本,以下代码仅提取单词 "Area" 而不是以下文本。
tree = lxml.html.fromstring(response.text)
xpath_ex= '//b[contains(text(),"Area:")]/descendant::text()'
raw_ex = tree.xpath(xpath_ex)
您评论中发布的 html 不完整,但假设它看起来像这样:
resp = """
<div class="text"><h4>ABC, Assistant Professor </h4>
<p><b>Area:</b> Natural Language Processing, Artificial Intelligence,
Computer Graphics, Computer Vision<a href=" somelink/people/Faculty/Profile/ABC.html"></a> </p> <p> <a href="/computing/people/faculty/ABC.html">Profile & Contact Information </a> | Home Page</p>
</div>
"""
试试这个:
from lxml.html import fromstring
tree = fromstring(resp)
xpath_ex= tree.xpath('//div[@class="text"]/p')
print(xpath_ex[0].text_content())
输出:
Area: Natural Language Processing, Artificial Intelligence, Computer Graphics, Computer Vision
如何使用xpath提取标签之间的文本?例如,我试图提取以 "Area:" 开头的文本,以下代码仅提取单词 "Area" 而不是以下文本。
tree = lxml.html.fromstring(response.text)
xpath_ex= '//b[contains(text(),"Area:")]/descendant::text()'
raw_ex = tree.xpath(xpath_ex)
您评论中发布的 html 不完整,但假设它看起来像这样:
resp = """
<div class="text"><h4>ABC, Assistant Professor </h4>
<p><b>Area:</b> Natural Language Processing, Artificial Intelligence,
Computer Graphics, Computer Vision<a href=" somelink/people/Faculty/Profile/ABC.html"></a> </p> <p> <a href="/computing/people/faculty/ABC.html">Profile & Contact Information </a> | Home Page</p>
</div>
"""
试试这个:
from lxml.html import fromstring
tree = fromstring(resp)
xpath_ex= tree.xpath('//div[@class="text"]/p')
print(xpath_ex[0].text_content())
输出:
Area: Natural Language Processing, Artificial Intelligence, Computer Graphics, Computer Vision