刮掉里面有 href 的段落
Scrape paragraph with href inside
这是html:
<p class="myParagraph">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus vel justo
<a href="http://google.it" class="small-link" target="_blank">
<span class="tco-ellipsis"></span>
<span class="invisible">https://</span>
<span class="js-display-url">google.it</span>
<span class="invisible">lpage/events/?ref=page_internal&mt_nav=0&locale2=it_IT</span>
<span class="tco-ellipsis">
<span class="invisible"> </span>…
</span>
</a> ornare, suscipit nisl eget, aliquam augue. Aenean quis pretium
</p>
如果我使用 tree.xpath('//p/text()')
它只会 returns 我
让痛苦本身变得重要,减肥精英将随之而来。活着还是只是
而不是
让痛苦本身变得重要,减肥精英将随之而来。住或只是装修,粉丝需要一些宣传。埃涅阿斯是谁的价格
我也试过了tree.xpath('string(//p)')
here
我怎样才能同时使用完整的段落和 href?并非每次
段落中都有 a
元素
xpath('//p/text()')
returns 字符串列表。加入这些字符串以获得想要的结果。
from lxml import html
doc = """<p class="myParagraph">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus vel justo
<a href="http://google.it" class="small-link" target="_blank">
<span class="tco-ellipsis"></span>
<span class="invisible">https://</span>
<span class="js-display-url">google.it</span>
<span class="invisible">lpage/events/?ref=page_internal&mt_nav=0&locale2=it_IT</span>
<span class="tco-ellipsis">
<span class="invisible"> </span>…
</span>
</a> ornare, suscipit nisl eget, aliquam augue. Aenean quis pretium
</p>"""
root = html.fromstring(doc)
print("".join([t for t in root.xpath("//p/text()")]))
这是html:
<p class="myParagraph">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus vel justo
<a href="http://google.it" class="small-link" target="_blank">
<span class="tco-ellipsis"></span>
<span class="invisible">https://</span>
<span class="js-display-url">google.it</span>
<span class="invisible">lpage/events/?ref=page_internal&mt_nav=0&locale2=it_IT</span>
<span class="tco-ellipsis">
<span class="invisible"> </span>…
</span>
</a> ornare, suscipit nisl eget, aliquam augue. Aenean quis pretium
</p>
如果我使用 tree.xpath('//p/text()')
它只会 returns 我
让痛苦本身变得重要,减肥精英将随之而来。活着还是只是
而不是
让痛苦本身变得重要,减肥精英将随之而来。住或只是装修,粉丝需要一些宣传。埃涅阿斯是谁的价格
我也试过了tree.xpath('string(//p)')
here
我怎样才能同时使用完整的段落和 href?并非每次
a
元素
xpath('//p/text()')
returns 字符串列表。加入这些字符串以获得想要的结果。
from lxml import html
doc = """<p class="myParagraph">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus vel justo
<a href="http://google.it" class="small-link" target="_blank">
<span class="tco-ellipsis"></span>
<span class="invisible">https://</span>
<span class="js-display-url">google.it</span>
<span class="invisible">lpage/events/?ref=page_internal&mt_nav=0&locale2=it_IT</span>
<span class="tco-ellipsis">
<span class="invisible"> </span>…
</span>
</a> ornare, suscipit nisl eget, aliquam augue. Aenean quis pretium
</p>"""
root = html.fromstring(doc)
print("".join([t for t in root.xpath("//p/text()")]))