JSoup:获取维基百科页面摘要
JSoup: get wikipedia page summary
我使用 MediaWiki API 获取维基百科页面,在获取 html 内容后我尝试使用
p:not(h2 ~ p)
获取page summary段落,应该是contents元素table之前的段落,得到想要的部分但是多了几个段落,问题出在哪里?
p:not(h2 ~ p)
得到 页面上 中同一父级中前面没有 h2
的每个段落。这包括嵌套段落、完全在主要内容之外的段落等,因为这些段落中的 none 与 h2
本身共享相同的父元素。你不想要那些;您只需要在其父元素中 h2
元素之前出现的段落。
为此,您希望将外部 p
选择器锚定到父元素。您想要的父元素是 .mw-parser-output
:
.mw-parser-output > p:not(h2 ~ p)
代码:
public static void main(String[] args){
Document doc = null;
String url = "https://en.wikipedia.org/wiki/Nico_Ditch";
try {
doc = Jsoup.parse(new URL(url).openStream(), "UTF-8", url);
} catch (IOException e) {
e.printStackTrace();
}
Elements els = doc.select(".mw-parser-output > p:not(h2 ~ p)");
System.out.println(els);
// System.out.println(doc);
}
运行 输出:
<p class="mw-empty-elt"> </p>
<p><b>Nico Ditch</b> is a six-mile (9.7 km) long linear <a href="/wiki/Earthworks_(archaeology)" title="Earthworks (archaeology)">earthwork</a> between <a href="/wiki/Ashton-under-Lyne" title="Ashton-under-Lyne">Ashton-under-Lyne</a> and <a href="/wiki/Stretford" title="Stretford">Stretford</a> in Greater Manchester, England. It was dug as a defensive fortification, or possibly a boundary marker, between the 5th and 11th centuries. </p>
<p>The ditch is still visible in short sections, such as a 330-yard (300 m) stretch in <a href="/wiki/Denton,_Greater_Manchester" title="Denton, Greater Manchester">Denton</a> Golf Course. In the parts which survive, the ditch is 4–5 yards (3.7–4.6 m) wide and up to 5 feet (1.5 m) deep. Part of the earthwork is protected as a <a href="/wiki/Scheduled_Ancient_Monument" class="mw-redirect" title="Scheduled Ancient Monument">Scheduled Ancient Monument</a>. </p>
进程已完成,退出代码为 0
我使用 MediaWiki API 获取维基百科页面,在获取 html 内容后我尝试使用
p:not(h2 ~ p)
获取page summary段落,应该是contents元素table之前的段落,得到想要的部分但是多了几个段落,问题出在哪里?
p:not(h2 ~ p)
得到 页面上 中同一父级中前面没有 h2
的每个段落。这包括嵌套段落、完全在主要内容之外的段落等,因为这些段落中的 none 与 h2
本身共享相同的父元素。你不想要那些;您只需要在其父元素中 h2
元素之前出现的段落。
为此,您希望将外部 p
选择器锚定到父元素。您想要的父元素是 .mw-parser-output
:
.mw-parser-output > p:not(h2 ~ p)
代码:
public static void main(String[] args){
Document doc = null;
String url = "https://en.wikipedia.org/wiki/Nico_Ditch";
try {
doc = Jsoup.parse(new URL(url).openStream(), "UTF-8", url);
} catch (IOException e) {
e.printStackTrace();
}
Elements els = doc.select(".mw-parser-output > p:not(h2 ~ p)");
System.out.println(els);
// System.out.println(doc);
}
运行 输出:
<p class="mw-empty-elt"> </p>
<p><b>Nico Ditch</b> is a six-mile (9.7 km) long linear <a href="/wiki/Earthworks_(archaeology)" title="Earthworks (archaeology)">earthwork</a> between <a href="/wiki/Ashton-under-Lyne" title="Ashton-under-Lyne">Ashton-under-Lyne</a> and <a href="/wiki/Stretford" title="Stretford">Stretford</a> in Greater Manchester, England. It was dug as a defensive fortification, or possibly a boundary marker, between the 5th and 11th centuries. </p>
<p>The ditch is still visible in short sections, such as a 330-yard (300 m) stretch in <a href="/wiki/Denton,_Greater_Manchester" title="Denton, Greater Manchester">Denton</a> Golf Course. In the parts which survive, the ditch is 4–5 yards (3.7–4.6 m) wide and up to 5 feet (1.5 m) deep. Part of the earthwork is protected as a <a href="/wiki/Scheduled_Ancient_Monument" class="mw-redirect" title="Scheduled Ancient Monument">Scheduled Ancient Monument</a>. </p>
进程已完成,退出代码为 0