JSoup 未通过 class 正确提取元素
JSoup not properly extracting elements by class
我在网页中有以下元素:
<div id="pnNij" class="post" data-tag1="" data-tag2="">
<a class="image-list-link" href="http://imgur.com/gallery/pnNij" data-page="0">
<img alt="" src="./Imgur_ The most awesome images on the Internet_files/H7fZCNgb.jpg">
<div class="point-info gradient-transparent-black transition">
<div class="relative">
<div class="pa-bottom">
<div class="arrows">
<div title="like" class="pointer arrow-up icon-upvote-outline" data="pnNij" type="image" data-up="4212"></div>
<div title="dislike" class="pointer arrow-down icon-downvote-outline" data="pnNij" type="image" data-downs="502"></div>
<div class="clear"></div>
</div>
<div class="point-info-points" title="points">
<span class="points-pnNij">3,710</span>
<span class="points-text-pnNij">points</span>
</div>
</div>
</div>
</div>
</a>
<div class="hover">
<p>Seems like 2017 has it all...</p>
<div class="post-info">
album · 69,542 views
</div>
</div>
</div>
注意 href 如何等于 http://imgur.com/gallery/pnNij。
但是,当我像这样使用 JSoup 从页面中提取元素时:
docImgur = Jsoup.connect("http://imgur.com/").get();
Elements links = docImgur.getElementsByClass("post");
除了 href 属性等于 /gallery/pnNij/
外,该元素几乎已正确提取
为什么 href 属性不包含完整的 URL?
当您检查页面源代码时,您会发现
<a class="image-list-link" href="/gallery/WRzti" data-page="0">
...
</a>
所以href属性不是绝对的,这导致了你预期的结果:/gallery/WRzti
解决方案
例子
Document docImgur = Jsoup.connect("http://imgur.com/").get();
Elements links = docImgur.select("a[href].image-list-link");
for (Element element : links) {
System.out.println(element.attr("abs:href"));
}
输出
http://imgur.com/gallery/WRzti
http://imgur.com/gallery/tCnDJ
http://imgur.com/gallery/JIHYh
...
我在网页中有以下元素:
<div id="pnNij" class="post" data-tag1="" data-tag2="">
<a class="image-list-link" href="http://imgur.com/gallery/pnNij" data-page="0">
<img alt="" src="./Imgur_ The most awesome images on the Internet_files/H7fZCNgb.jpg">
<div class="point-info gradient-transparent-black transition">
<div class="relative">
<div class="pa-bottom">
<div class="arrows">
<div title="like" class="pointer arrow-up icon-upvote-outline" data="pnNij" type="image" data-up="4212"></div>
<div title="dislike" class="pointer arrow-down icon-downvote-outline" data="pnNij" type="image" data-downs="502"></div>
<div class="clear"></div>
</div>
<div class="point-info-points" title="points">
<span class="points-pnNij">3,710</span>
<span class="points-text-pnNij">points</span>
</div>
</div>
</div>
</div>
</a>
<div class="hover">
<p>Seems like 2017 has it all...</p>
<div class="post-info">
album · 69,542 views
</div>
</div>
</div>
注意 href 如何等于 http://imgur.com/gallery/pnNij。
但是,当我像这样使用 JSoup 从页面中提取元素时:
docImgur = Jsoup.connect("http://imgur.com/").get();
Elements links = docImgur.getElementsByClass("post");
除了 href 属性等于 /gallery/pnNij/
外,该元素几乎已正确提取为什么 href 属性不包含完整的 URL?
当您检查页面源代码时,您会发现
<a class="image-list-link" href="/gallery/WRzti" data-page="0">
...
</a>
所以href属性不是绝对的,这导致了你预期的结果:/gallery/WRzti
解决方案
例子
Document docImgur = Jsoup.connect("http://imgur.com/").get();
Elements links = docImgur.select("a[href].image-list-link");
for (Element element : links) {
System.out.println(element.attr("abs:href"));
}
输出
http://imgur.com/gallery/WRzti
http://imgur.com/gallery/tCnDJ
http://imgur.com/gallery/JIHYh
...