带有 jsoup 或 tika 的 getText():具有带回车符的 li 元素 return

getText() with jsoup or tika: having li elements with carriage return

是否有可能在获取 html 页面(使用 tika 或 jsoup)的全文时,在每个 'li' 元素之间添加回车符 return?

今天我把所有的文字都压缩了。

谢谢

这里是Andrew Phillips的改进版。

Java

package com.github.davidepastore.Whosebug33947074;

import java.io.IOException;
import java.io.InputStream;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;

/**
 * Whosebug 33947074
 *
 */
public class App 
{
    public static void main( String[] args ) throws IOException {
        ClassLoader classloader = Thread.currentThread()
                .getContextClassLoader();
        InputStream is = classloader.getResourceAsStream("file.html");
        Document document = Jsoup.parse(is, "UTF-8", "");
        Element element = document.select("html").first();
        String text = getText(element);
        System.out.println("Result: " + text);
    }

    /**
     * Get the custom text from the given {@link Element}.
     * @param element The {@link Element} from which get the custom text.
     * @return Returns the custom text.
     */
    private static String getText(Element element) {
        String working = "";
        for (Node child : element.childNodes()) {
             if (child instanceof TextNode) {
                 working += ((TextNode) child).text();
             }
             if (child instanceof Element) {
                 Element childElement = (Element)child;
                 if (childElement.tag().getName().equalsIgnoreCase("li")) {
                      working += "\n";
                 }                  
                 working += getText(childElement);
             }
        }
        return working;
    }
}

file.html

<html>
<head>
<title>Try jsoup</title>
</head>
<body>
<p>This is <a href="http://jsoup.org/">jsoup</a>.</p>
<ul>
    <li>First element</li>
    <li><a href="#">Second element</a></li>
    <li>Third element <b>Additional for third element</b></li>
</ul>
</body>
</html>

输出

Result:  Try jsoup   This is jsoup.  
First element 
Second element 
Third element Additional for third element