用 jsoup 解析 xml(同时避免 <p> 标签)

parsing xml with jsoup (while avoiding <p> tags)

这个问题在本质上与非常相似,但是java而不是python。

<body.content>
  <block class="lead_paragraph">
    <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
  </block>
  <block class="full_text">
    <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
  </block>

我想做的是使用 jsoup 提取没有所有 xml 格式的句子文本。

所以我在找

LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.

更新

事实上我的情况有点不同,因为我有一些额外的 XML 格式我想保留,即 <PERSON>

 <block class="full_text">
    <p>SCHEINMAN</PERSON>--<PERSON>Alan</PERSON>. Happy Birthday. Thirteen years, many tears. Loving memories of your smile, humor, and laughter comfort us. You are always in our hearts. Love, <PERSON>Roni</PERSON>, <PERSON>Sandy</PERSON>, <PERSON>Jarret</PERSON>, <PERSON>Greg</PERSON>, <PERSON>Kate</PERSON>, and <PERSON>Auden Gray</PERSON></p>
 </block></body.content></body></nitf>

理想的输出是:

SCHEINMAN</PERSON>--<PERSON>Alan</PERSON>. Happy Birthday. Thirteen years, many tears. Loving memories of your smile, humor, and laughter comfort us. You are always in our hearts. Love, <PERSON>Roni</PERSON>, <PERSON>Sandy</PERSON>, <PERSON>Jarret</PERSON>, <PERSON>Greg</PERSON>, <PERSON>Kate</PERSON>, and <PERSON>Auden Gray</PERSON>

我目前的尝试:

BufferedReader br = new BufferedReader(new FileReader(filename));
try 
{
  StringBuilder sb = new StringBuilder();
  String line = br.readLine();

  while (line != null) 
  {
    sb.append(line);
    sb.append(System.lineSeparator());
    line = br.readLine();
  }
  String everything = sb.toString();

  Document doc = Jsoup.parse(everything);
  String link = doc.select("block.full_text").text();
  System.out.println(link);      
}
finally 
{
  br.close();
}

您可以在 jsoup 中使用 CSS 选择器。

String html = "<body.content>\n"
        + "  <block class=\"lead_paragraph\">\n"
        + "    <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>\n"
        + "  </block>\n"
        + "  <block class=\"full_text\">\n"
        + "    <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>\n"
        + "  </block>";
Document doc = Jsoup.parse(html);
String link = doc.select("block.full_text").text();
System.out.println(link);

输出:

LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.

更新:

String html = "<block class=\"full_text\">\n"
        + "    <p>SCHEINMAN</PERSON>--<PERSON>Alan</PERSON>. Happy Birthday. Thirteen years, many tears. Loving memories of your smile, humor, and laughter comfort us. You are always in our hearts. Love, <PERSON>Roni</PERSON>, <PERSON>Sandy</PERSON>, <PERSON>Jarret</PERSON>, <PERSON>Greg</PERSON>, <PERSON>Kate</PERSON>, and <PERSON>Auden Gray</PERSON></p></block></body.content></body></nitf>";
Document doc = Jsoup.parse(html);
String link = doc.select("block.full_text").html();
System.out.println(link);

输出:

<p>SCHEINMAN--
 <person>
  Alan
 </person>. Happy Birthday. Thirteen years, many tears. Loving memories of your smile, humor, and laughter comfort us. You are always in our hearts. Love, 
 <person>
  Roni
 </person>, 
 <person>
  Sandy
 </person>, 
 <person>
  Jarret
 </person>, 
 <person>
  Greg
 </person>, 
 <person>
  Kate
 </person>, and 
 <person>
  Auden Gray
 </person></p>