使用 java 搜索文本并在 pdf 中获取位置

Question

如何使用 java 搜索文本并在 pdf 中获取位置？我尝试使用 apache pdfbox 和 pdfclown，但每当文本下降或开始新段落时，它都不起作用。我想得到与下图相同的结果。

谢谢。

Desired result

Answer 1

您提到作为 PDFBox 的示例，但它不适合您。事实上，正如该答案中已经解释的那样，令人惊讶的是看到代码匹配任何单词以外的任何东西，因为在那里被覆盖的例程的调用者给人的印象是逐字调用它。因此，几乎不可能找到任何跨越单行的东西。

但是假设行在 space 处拆分，可以以一种非常自然的方式改进该示例以允许跨行边界搜索。将方法 findSubwords 替换为此改进版本：

List<TextPositionSequence> findSubwordsImproved(PDDocument document, int page, String searchTerm) throws IOException
{
    final List<TextPosition> allTextPositions = new ArrayList<>();
    PDFTextStripper stripper = new PDFTextStripper()
    {
        @Override
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException
        {
            allTextPositions.addAll(textPositions);
            super.writeString(text, textPositions);
        }

        @Override
        protected void writeLineSeparator() throws IOException {
            if (!allTextPositions.isEmpty()) {
                TextPosition last = allTextPositions.get(allTextPositions.size() - 1);
                if (!" ".equals(last.getUnicode())) {
                    Matrix textMatrix = last.getTextMatrix().clone();
                    textMatrix.setValue(2, 0, last.getEndX());
                    textMatrix.setValue(2, 1, last.getEndY());
                    TextPosition separatorSpace = new TextPosition(last.getRotation(), last.getPageWidth(), last.getPageHeight(),
                            textMatrix, last.getEndX(), last.getEndY(), last.getHeight(), 0, last.getWidthOfSpace(), " ",
                            new int[] {' '}, last.getFont(), last.getFontSize(), (int) last.getFontSizeInPt());
                    allTextPositions.add(separatorSpace);
                }
            }
            super.writeLineSeparator();
        }
    };
    
    stripper.setSortByPosition(true);
    stripper.setStartPage(page);
    stripper.setEndPage(page);
    stripper.getText(document);

    final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();
    TextPositionSequence word = new TextPositionSequence(allTextPositions);
    String string = word.toString();

    int fromIndex = 0;
    int index;
    while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
    {
        hits.add(word.subSequence(index, index + searchTerm.length()));
        fromIndex = index + 1;
    }

    return hits;
}

(SearchSubword方法)

这里我们收集所有 TextPosition 条目，实际上我们甚至在 PDFBox 添加换行符时添加代表 space 的虚拟此类条目。整个页面呈现后，我们搜索所有这些文本位置的集合。

应用于原题中的example document，

现在正在寻找 "${var 2}" returns 所有 8 次出现，也包括跨行的那些：

* Looking for '${var 2}' (improved)
  Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997
  Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74
  Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998
  Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81
  Page 1 at 164.39648, 357.28998 with width -46.081444 and last letter '}' at 112.46, 372.65
  Page 1 at 174.97762, 388.72998 with width -56.662575 and last letter '}' at 112.46, 404.09
  Page 1 at 153.74, 420.16998 with width -32.004005 and last letter '}' at 112.46, 435.65
  Page 1 at 162.99922, 451.61 with width -43.692017 and last letter '}' at 112.46, 467.21

出现负宽度是因为匹配结束的 x 坐标小于开始的 x 坐标。

使用 java 搜索文本并在 pdf 中获取位置

Search texts and get position in pdf with java

java

pdf

pdfbox