无法使用 Jsoup HTML 解析器 Java 实现某些功能

Question

我无法使用 Jsoup Java 库为以下场景解析某些文本。

1 : This is My Text some other text as well non empty tag1 other text.

预期输出： some other text as well 

2 : This is My Text some other text as well non empty tag2 other text.

预期输出： some other text as well 

3 : This is My Text some other text as well non empty tag2 other text non empty tag3.

预期输出： some other text as well 

在这里，如果你注意到文本 My Text 是固定的（静态的）但第二个非空（不要将 space 视为值）B 标签值可能会有所不同。正则表达式应该能够提取 My Text 和之后第一次出现的非空  标记之间的文本。

我正在使用 Jsoup 库，但无法实现上述预期输出。请确保解决方案对于每种情况都应该是通用的，因为在我的情况下它是动态的。

Answer 1

    public static void main(String[] args) {
      String html = "This is <b>My Text</b> some other <b> </b> text as well <b></b><b>non empty tag1</b> other text";
      System.out.println(getTargetText(html));
      html = "This is <b>My Text</b> some other <b> </b> text as well <b></b><b>non empty tag2</b> other text";
      System.out.println(getTargetText(html));
      html = "This is <b>My Text</b> some other <b> </b> text as well <b></b><b>non empty tag2</b> other text <b></b> <b>non empty tag3</b>";
      System.out.println(getTargetText(html));
    }

    public static String getTargetText(String html) {
      Document doc = Jsoup.parse(html);
      Elements bTags = doc.getElementsByTag("b");
      Element startBTag = null;
      Element endBTag = null;

      for (int i = 0; i < bTags.size(); i++) {
        Element bTag = bTags.get(i);
        String text = bTag.text().trim(); // use html() instead of text() if you want to match nested inner tags.
        if (startBTag == null && text.equals("My Text")) {
          startBTag = bTag;
        }
        if (startBTag != null && text.startsWith("non empty tag")) { // here you can use regex match if you want
          endBTag = bTag;
          break;
        }
      }

      if (endBTag != null) {
        String startString = startBTag.outerHtml();
        String endString = endBTag.outerHtml();
        int startIndex = html.indexOf(startString);
        if (startIndex >= 0) {
          int endIndex = html.indexOf(endString, startIndex + startString.length());
          if (endIndex >= 0) {
            return html.substring(startIndex + startString.length(), endIndex);
          }
        }
      }
      return null;
    }

输出：

     some other <b> </b> text as well <b></b>
     some other <b> </b> text as well <b></b>
     some other <b> </b> text as well <b></b>

Answer 2

简单的解决方案可能看起来像

找到您感兴趣的  个元素（带有您要查找的文本的元素）
遍历放置在它后面的兄弟姐妹并打印它们，直到找到非空

你只需要记住 Jsoup 使用 Node 来存储所有元素（包括不属于标签的文本），而 Element class（扩展 Node) 可能只包含特定的标签。

例如

这样的文本

before <b>bold</b> after<i>italic</i>

将表示为

<node>before </node>
<element tag="B">
   <node>bold</node>
</element>
<node> after</node>
<element tag="I">
   <node>italic</node>
</element>

因此，例如，如果您 select("b")（将找到 <element tab="B">）并调用 nextElementSibling()，它将把您移动到 <element tag="I">。要获得 <node>after</node>，您需要使用 nextSibling()，它不会消除简单的文本节点。

Node class 可能存在的问题是它没有提供 text() 方法来生成当前节点的文本内容（这可以让我们测试当前 node/element 有任何文本）。但是没有什么能阻止我们将处理标签的Node转换为提供这种方法的Element。

因此我们的解决方案可能如下所示：

public static String findFragment(String html, String fixedStart) {

    Document doc = Jsoup.parse(html);
    Element myBTag = doc
            .select("b:matches(^" + Pattern.quote(fixedStart) + "$)")
            .first();

    StringBuilder sb = new StringBuilder();
    boolean foundNonEmpty = false;

    Node currentSibling = myBTag.nextSibling();
    while (currentSibling != null && !foundNonEmpty) {
        if (currentSibling.nodeName().equals("b")) {
            Element b = (Element) currentSibling;
            if (!b.text().trim().isEmpty())
                foundNonEmpty = true;
        }
        sb.append(currentSibling.toString());
        currentSibling = currentSibling.nextSibling();
    }

    return sb.toString();
}

无法使用 Jsoup HTML 解析器 Java 实现某些功能

Not able to achieve something with Jsoup HTML parser Java

html

java

html-parsing

jsoup