刮掉里面有 href 的段落

Question

这是html:

<p class="myParagraph">
  Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus vel justo
  <a href="http://google.it" class="small-link" target="_blank">
    <span class="tco-ellipsis"></span>
    <span class="invisible">https://</span>
    <span class="js-display-url">google.it</span>
    <span class="invisible">lpage/events/?ref=page_internal&amp;mt_nav=0&amp;locale2=it_IT</span>
    <span class="tco-ellipsis">
      <span class="invisible">&nbsp;</span>…
    </span>
  </a> ornare, suscipit nisl eget, aliquam augue. Aenean quis pretium
</p>

如果我使用 tree.xpath('//p/text()') 它只会 returns 我

让痛苦本身变得重要，减肥精英将随之而来。活着还是只是

而不是

让痛苦本身变得重要，减肥精英将随之而来。住或只是装修，粉丝需要一些宣传。埃涅阿斯是谁的价格

我也试过了tree.xpath('string(//p)')here 我怎样才能同时使用完整的段落和 href？并非每次

段落中都有 a 元素

Answer 1

xpath('//p/text()') returns 字符串列表。加入这些字符串以获得想要的结果。

from lxml import html

doc = """<p class="myParagraph">
  Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus vel justo
  <a href="http://google.it" class="small-link" target="_blank">
    <span class="tco-ellipsis"></span>
    <span class="invisible">https://</span>
    <span class="js-display-url">google.it</span>
    <span class="invisible">lpage/events/?ref=page_internal&amp;mt_nav=0&amp;locale2=it_IT</span>
    <span class="tco-ellipsis">
      <span class="invisible">&nbsp;</span>…
    </span>
  </a> ornare, suscipit nisl eget, aliquam augue. Aenean quis pretium
</p>"""

root = html.fromstring(doc)
print("".join([t for t in root.xpath("//p/text()")]))

刮掉里面有 href 的段落

Scrape paragraph with href inside

xpath

lxml

web-scraping