用 jsoup 解析 xml(同时避免 <p> 标签)
parsing xml with jsoup (while avoiding <p> tags)
这个问题在本质上与非常相似,但是java而不是python。
<body.content>
<block class="lead_paragraph">
<p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
</block>
<block class="full_text">
<p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
</block>
我想做的是使用 jsoup 提取没有所有 xml 格式的句子文本。
所以我在找
LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.
更新
事实上我的情况有点不同,因为我有一些额外的 XML 格式我想保留,即 <PERSON>
<block class="full_text">
<p>SCHEINMAN</PERSON>--<PERSON>Alan</PERSON>. Happy Birthday. Thirteen years, many tears. Loving memories of your smile, humor, and laughter comfort us. You are always in our hearts. Love, <PERSON>Roni</PERSON>, <PERSON>Sandy</PERSON>, <PERSON>Jarret</PERSON>, <PERSON>Greg</PERSON>, <PERSON>Kate</PERSON>, and <PERSON>Auden Gray</PERSON></p>
</block></body.content></body></nitf>
理想的输出是:
SCHEINMAN</PERSON>--<PERSON>Alan</PERSON>. Happy Birthday. Thirteen years, many tears. Loving memories of your smile, humor, and laughter comfort us. You are always in our hearts. Love, <PERSON>Roni</PERSON>, <PERSON>Sandy</PERSON>, <PERSON>Jarret</PERSON>, <PERSON>Greg</PERSON>, <PERSON>Kate</PERSON>, and <PERSON>Auden Gray</PERSON>
我目前的尝试:
BufferedReader br = new BufferedReader(new FileReader(filename));
try
{
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null)
{
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
String everything = sb.toString();
Document doc = Jsoup.parse(everything);
String link = doc.select("block.full_text").text();
System.out.println(link);
}
finally
{
br.close();
}
您可以在 jsoup 中使用 CSS 选择器。
String html = "<body.content>\n"
+ " <block class=\"lead_paragraph\">\n"
+ " <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>\n"
+ " </block>\n"
+ " <block class=\"full_text\">\n"
+ " <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>\n"
+ " </block>";
Document doc = Jsoup.parse(html);
String link = doc.select("block.full_text").text();
System.out.println(link);
输出:
LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.
更新:
String html = "<block class=\"full_text\">\n"
+ " <p>SCHEINMAN</PERSON>--<PERSON>Alan</PERSON>. Happy Birthday. Thirteen years, many tears. Loving memories of your smile, humor, and laughter comfort us. You are always in our hearts. Love, <PERSON>Roni</PERSON>, <PERSON>Sandy</PERSON>, <PERSON>Jarret</PERSON>, <PERSON>Greg</PERSON>, <PERSON>Kate</PERSON>, and <PERSON>Auden Gray</PERSON></p></block></body.content></body></nitf>";
Document doc = Jsoup.parse(html);
String link = doc.select("block.full_text").html();
System.out.println(link);
输出:
<p>SCHEINMAN--
<person>
Alan
</person>. Happy Birthday. Thirteen years, many tears. Loving memories of your smile, humor, and laughter comfort us. You are always in our hearts. Love,
<person>
Roni
</person>,
<person>
Sandy
</person>,
<person>
Jarret
</person>,
<person>
Greg
</person>,
<person>
Kate
</person>, and
<person>
Auden Gray
</person></p>
这个问题在本质上与
<body.content>
<block class="lead_paragraph">
<p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
</block>
<block class="full_text">
<p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
</block>
我想做的是使用 jsoup 提取没有所有 xml 格式的句子文本。
所以我在找
LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.
更新
事实上我的情况有点不同,因为我有一些额外的 XML 格式我想保留,即 <PERSON>
<block class="full_text">
<p>SCHEINMAN</PERSON>--<PERSON>Alan</PERSON>. Happy Birthday. Thirteen years, many tears. Loving memories of your smile, humor, and laughter comfort us. You are always in our hearts. Love, <PERSON>Roni</PERSON>, <PERSON>Sandy</PERSON>, <PERSON>Jarret</PERSON>, <PERSON>Greg</PERSON>, <PERSON>Kate</PERSON>, and <PERSON>Auden Gray</PERSON></p>
</block></body.content></body></nitf>
理想的输出是:
SCHEINMAN</PERSON>--<PERSON>Alan</PERSON>. Happy Birthday. Thirteen years, many tears. Loving memories of your smile, humor, and laughter comfort us. You are always in our hearts. Love, <PERSON>Roni</PERSON>, <PERSON>Sandy</PERSON>, <PERSON>Jarret</PERSON>, <PERSON>Greg</PERSON>, <PERSON>Kate</PERSON>, and <PERSON>Auden Gray</PERSON>
我目前的尝试:
BufferedReader br = new BufferedReader(new FileReader(filename));
try
{
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null)
{
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
String everything = sb.toString();
Document doc = Jsoup.parse(everything);
String link = doc.select("block.full_text").text();
System.out.println(link);
}
finally
{
br.close();
}
您可以在 jsoup 中使用 CSS 选择器。
String html = "<body.content>\n"
+ " <block class=\"lead_paragraph\">\n"
+ " <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>\n"
+ " </block>\n"
+ " <block class=\"full_text\">\n"
+ " <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>\n"
+ " </block>";
Document doc = Jsoup.parse(html);
String link = doc.select("block.full_text").text();
System.out.println(link);
输出:
LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.
更新:
String html = "<block class=\"full_text\">\n"
+ " <p>SCHEINMAN</PERSON>--<PERSON>Alan</PERSON>. Happy Birthday. Thirteen years, many tears. Loving memories of your smile, humor, and laughter comfort us. You are always in our hearts. Love, <PERSON>Roni</PERSON>, <PERSON>Sandy</PERSON>, <PERSON>Jarret</PERSON>, <PERSON>Greg</PERSON>, <PERSON>Kate</PERSON>, and <PERSON>Auden Gray</PERSON></p></block></body.content></body></nitf>";
Document doc = Jsoup.parse(html);
String link = doc.select("block.full_text").html();
System.out.println(link);
输出:
<p>SCHEINMAN--
<person>
Alan
</person>. Happy Birthday. Thirteen years, many tears. Loving memories of your smile, humor, and laughter comfort us. You are always in our hearts. Love,
<person>
Roni
</person>,
<person>
Sandy
</person>,
<person>
Jarret
</person>,
<person>
Greg
</person>,
<person>
Kate
</person>, and
<person>
Auden Gray
</person></p>