如何从 iText 7 中的 pdf 页面获取文本位置

Question

我正在尝试查找 PDF 页面中的文本位置？

我尝试过的是使用简单的文本提取策略通过 PDF 文本提取器获取 PDF 页面中的文本。我正在循环每个单词以检查我的单词是否存在。使用以下单词拆分单词：

var Words = pdftextextractor.Split(new char[] { ' ', '\n' });

我无法做的是找到文本位置。问题是我无法找到文本的位置。我只需要找到 PDF 文件中单词的 y 坐标。

Answer 1

首先，SimpleTextExtractionStrategy 并不完全是 'smartest' 策略（顾名思义。

其次，如果你想要这个职位，你将不得不做更多的工作。 TextExtractionStrategy 假定您只对文本感兴趣。

可能的实现：

实施 IEventListener
获取所有呈现文本的事件的通知，并存储相应的 TextRenderInfo 对象
完成文档后，根据对象在页面中的位置对这些对象进行排序
遍历此 TextRenderInfo 对象列表，它们同时提供正在呈现的文本和坐标

如何：

实施 ITextExtractionStrategy（或扩展现有的实施）
使用PdfTextExtractor.getTextFromPage(doc.getPage(pageNr), strategy)，其中strategy表示您在步骤1中创建的策略
您的策略应该设置为跟踪它处理的文本的位置

ITextExtractionStrategy 在其接口中有以下方法：

@Override
public void eventOccurred(IEventData data, EventType type) {

    // you can first check the type of the event
     if (!type.equals(EventType.RENDER_TEXT))
        return;

    // now it is safe to cast
    TextRenderInfo renderInfo = (TextRenderInfo) data;
}

需要牢记的重要一点是，pdf 中的渲染说明不需要按顺序出现。文本 "Lorem Ipsum Dolor Sit Amet" 可以使用类似于以下的指令呈现：渲染 "Ipsum Do"
渲染 "Lorem "
渲染 "lor Sit Amet"

您将不得不进行一些巧妙的合并（取决于两个 TextRenderInfo 对象相距多远）和排序（以正确的阅读顺序获取所有 TextRenderInfo 对象。

一旦完成，应该很容易。

Answer 2

我能够使用我以前的 Itext5 版本来操作它。我不知道您是否正在寻找 C#，但这就是下面代码所写的内容。

using iText.Kernel.Geom;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Data;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

class TextLocationStrategy : LocationTextExtractionStrategy
{
    private List<textChunk> objectResult = new List<textChunk>();

    public override void EventOccurred(IEventData data, EventType type)
    {
        if (!type.Equals(EventType.RENDER_TEXT))
            return;

        TextRenderInfo renderInfo = (TextRenderInfo)data;

        string curFont = renderInfo.GetFont().GetFontProgram().ToString();

        float curFontSize = renderInfo.GetFontSize();

        IList<TextRenderInfo> text = renderInfo.GetCharacterRenderInfos();
        foreach (TextRenderInfo t in text)
        {
            string letter = t.GetText();
            Vector letterStart = t.GetBaseline().GetStartPoint();
            Vector letterEnd = t.GetAscentLine().GetEndPoint();
            Rectangle letterRect = new Rectangle(letterStart.Get(0), letterStart.Get(1), letterEnd.Get(0) - letterStart.Get(0), letterEnd.Get(1) - letterStart.Get(1));

            if (letter != " " && !letter.Contains(' '))
            {
                textChunk chunk = new textChunk();
                chunk.text = letter;
                chunk.rect = letterRect;
                chunk.fontFamily = curFont;
                chunk.fontSize = curFontSize;
                chunk.spaceWidth = t.GetSingleSpaceWidth() / 2f;

                objectResult.Add(chunk);
            }
        }
    }
}
public class textChunk
{
    public string text { get; set; }
    public Rectangle rect { get; set; }
    public string fontFamily { get; set; }
    public int fontSize { get; set; }
    public float spaceWidth { get; set; }
}

我也认真对待每个角色，因为它更适合我的流程。您可以操作名称，当然还有对象，但我创建了 textchunk 来保存我想要的内容，而不是一堆 renderInfo 对象。

您可以通过添加几行代码从您的 pdf 中获取数据来实现这一点。

PdfDocument reader = new PdfDocument(new PdfReader(filepath));
FilteredEventListener listener = new FilteredEventListener();
var strat = listener.AttachEventListener(new TextExtractionStrat());
PdfCanvasProcessor processor = new PdfCanvasProcessor(listener);
processor.ProcessPageContent(reader.GetPage(1));

一旦你做到这一点，你可以通过使它成为 public 或在你的 class 中创建一个方法来从 strat 中提取 objectResult 来获取 objectResult 并用它做一些事情。

Answer 3

解释了如何为任务实施全新的提取策略/事件侦听器。或者，可以尝试调整现有的文本提取策略来执行您需要的操作。

此答案演示了如何将现有的 LocationTextExtractionStrategy 调整为 return 文本及其字符各自的 y 坐标。

请注意，这只是一个概念验证，它特别假设文本是水平书写的，即使用 b 和 c 等于 0 的有效变换矩阵（ctm 和文本矩阵组合）。此外，TextPlusY 的字符和坐标检索方法根本没有优化，可能需要很长时间才能执行。

由于 OP 没有表达语言偏好，这里是 Java 的 iText7 解决方案：

TextPlusY

对于手头的任务，需要能够并排检索字符和 y 坐标。为了使这更容易，我使用 class 表示两个文本及其字符各自的 y 坐标。它源自 CharSequence，String 的泛化，这使得它可以用于许多 String 相关函数：

public class TextPlusY implements CharSequence
{
    final List<String> texts = new ArrayList<>();
    final List<Float> yCoords = new ArrayList<>();

    //
    // CharSequence implementation
    //
    @Override
    public int length()
    {
        int length = 0;
        for (String text : texts)
        {
            length += text.length();
        }
        return length;
    }

    @Override
    public char charAt(int index)
    {
        for (String text : texts)
        {
            if (index < text.length())
            {
                return text.charAt(index);
            }
            index -= text.length();
        }
        throw new IndexOutOfBoundsException();
    }

    @Override
    public CharSequence subSequence(int start, int end)
    {
        TextPlusY result = new TextPlusY();
        int length = end - start;
        for (int i = 0; i < yCoords.size(); i++)
        {
            String text = texts.get(i);
            if (start < text.length())
            {
                float yCoord = yCoords.get(i); 
                if (start > 0)
                {
                    text = text.substring(start);
                    start = 0;
                }
                if (length > text.length())
                {
                    result.add(text, yCoord);
                }
                else
                {
                    result.add(text.substring(0, length), yCoord);
                    break;
                }
            }
            else
            {
                start -= text.length();
            }
        }
        return result;
    }

    //
    // Object overrides
    //
    @Override
    public String toString()
    {
        StringBuilder builder = new StringBuilder();
        for (String text : texts)
        {
            builder.append(text);
        }
        return builder.toString();
    }

    //
    // y coordinate support
    //
    public TextPlusY add(String text, float y)
    {
        if (text != null)
        {
            texts.add(text);
            yCoords.add(y);
        }
        return this;
    }

    public float yCoordAt(int index)
    {
        for (int i = 0; i < yCoords.size(); i++)
        {
            String text = texts.get(i);
            if (index < text.length())
            {
                return yCoords.get(i);
            }
            index -= text.length();
        }
        throw new IndexOutOfBoundsException();
    }
}

(TextPlusY.java)

TextPlusYExtractionStrategy

现在我们扩展 LocationTextExtractionStrategy 以提取 TextPlusY 而不是 String。为此，我们只需要概括方法 getResultantText.

不幸的是，LocationTextExtractionStrategy 隐藏了一些需要在此处访问的方法和成员（private 或受保护的包）；因此，需要一些反射魔法。如果您的框架不允许这样做，您将不得不复制整个策略并相应地对其进行操作。

public class TextPlusYExtractionStrategy extends LocationTextExtractionStrategy
{
    static Field locationalResultField;
    static Method sortWithMarksMethod;
    static Method startsWithSpaceMethod;
    static Method endsWithSpaceMethod;

    static Method textChunkSameLineMethod;

    static
    {
        try
        {
            locationalResultField = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
            locationalResultField.setAccessible(true);
            sortWithMarksMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("sortWithMarks", List.class);
            sortWithMarksMethod.setAccessible(true);
            startsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("startsWithSpace", String.class);
            startsWithSpaceMethod.setAccessible(true);
            endsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("endsWithSpace", String.class);
            endsWithSpaceMethod.setAccessible(true);

            textChunkSameLineMethod = TextChunk.class.getDeclaredMethod("sameLine", TextChunk.class);
            textChunkSameLineMethod.setAccessible(true);
        }
        catch(NoSuchFieldException | NoSuchMethodException | SecurityException e)
        {
            // Reflection failed
        }
    }

    //
    // constructors
    //
    public TextPlusYExtractionStrategy()
    {
        super();
    }

    public TextPlusYExtractionStrategy(ITextChunkLocationStrategy strat)
    {
        super(strat);
    }

    @Override
    public String getResultantText()
    {
        return getResultantTextPlusY().toString();
    }

    public TextPlusY getResultantTextPlusY()
    {
        try
        {
            List<TextChunk> textChunks = new ArrayList<>((List<TextChunk>)locationalResultField.get(this));
            sortWithMarksMethod.invoke(this, textChunks);

            TextPlusY textPlusY = new TextPlusY();
            TextChunk lastChunk = null;
            for (TextChunk chunk : textChunks)
            {
                float chunkY = chunk.getLocation().getStartLocation().get(Vector.I2);
                if (lastChunk == null)
                {
                    textPlusY.add(chunk.getText(), chunkY);
                }
                else if ((Boolean)textChunkSameLineMethod.invoke(chunk, lastChunk))
                {
                    // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
                    if (isChunkAtWordBoundary(chunk, lastChunk) &&
                            !(Boolean)startsWithSpaceMethod.invoke(this, chunk.getText()) &&
                            !(Boolean)endsWithSpaceMethod.invoke(this, lastChunk.getText()))
                    {
                        textPlusY.add(" ", chunkY);
                    }

                    textPlusY.add(chunk.getText(), chunkY);
                }
                else
                {
                    textPlusY.add("\n", lastChunk.getLocation().getStartLocation().get(Vector.I2));
                    textPlusY.add(chunk.getText(), chunkY);
                }
                lastChunk = chunk;
            }

            return textPlusY;
        }
        catch (IllegalAccessException | IllegalArgumentException | InvocationTargetException e)
        {
            throw new RuntimeException("Reflection failed", e);
        }
    }
}

(TextPlusYExtractionStrategy.java)

用法

使用这两个 classes，您可以提取带坐标的文本并在其中搜索，如下所示：

try (   PdfReader reader = new PdfReader(YOUR_PDF);
        PdfDocument document = new PdfDocument(reader)  )
{
    TextPlusYExtractionStrategy extractionStrategy = new TextPlusYExtractionStrategy();
    PdfPage page = document.getFirstPage();

    PdfCanvasProcessor parser = new PdfCanvasProcessor(extractionStrategy);
    parser.processPageContent(page);
    TextPlusY textPlusY = extractionStrategy.getResultantTextPlusY();

    System.out.printf("\nText from test.pdf\n=====\n%s\n=====\n", textPlusY);

    System.out.print("\nText with y from test.pdf\n=====\n");
    
    int length = textPlusY.length();
    float lastY = Float.MIN_NORMAL;
    for (int i = 0; i < length; i++)
    {
        float y = textPlusY.yCoordAt(i);
        if (y != lastY)
        {
            System.out.printf("\n(%4.1f) ", y);
            lastY = y;
        }
        System.out.print(textPlusY.charAt(i));
    }
    System.out.print("\n=====\n");

    System.out.print("\nMatches of 'est' with y from test.pdf\n=====\n");
    Matcher matcher = Pattern.compile("est").matcher(textPlusY);
    while (matcher.find())
    {
        System.out.printf("from character %s to %s at y position (%4.1f)\n", matcher.start(), matcher.end(), textPlusY.yCoordAt(matcher.start()));
    }
    System.out.print("\n=====\n");
}

(ExtractTextPlusY测试方法testExtractTextPlusYFromTest)

我的测试文档

上面测试代码的输出是

Text from test.pdf
=====
Ein Dokumen t mit einigen
T estdaten
T esttest T est test test
=====

Text with y from test.pdf
=====

(691,8) Ein Dokumen t mit einigen

(666,9) T estdaten

(642,0) T esttest T est test test
=====

Matches of 'est' with y from test.pdf
=====
from character 28 to 31 at y position (666,9)
from character 39 to 42 at y position (642,0)
from character 43 to 46 at y position (642,0)
from character 49 to 52 at y position (642,0)
from character 54 to 57 at y position (642,0)
from character 59 to 62 at y position (642,0)

=====

我的语言环境使用逗号作为小数点分隔符，您可能会看到 666.9 而不是 666,9。

可以通过进一步微调基本 LocationTextExtractionStrategy 功能来删除您看到的额外空格。但这是其他问题的重点...

Answer 4

对于正在寻找简单矩形对象的任何人来说，这对我很有用。我制作了这两个类，并使用您的页面和所需的文本调用静态方法“GetText Coordinates”。

public class PdfTextLocator : LocationTextExtractionStrategy
{

    public string TextToSearchFor { get; set; }
    public List<TextChunk> ResultCoordinates { get; set; }

    /// <summary>
    /// Returns a rectangle with a given location of text on a page. Returns null if not found.
    /// </summary>
    /// <param name="page">Page to Search</param>
    /// <param name="s">String to be found</param>
    /// <returns></returns>
    public static Rectangle GetTextCoordinates(PdfPage page, string s) 
    {
        PdfTextLocator strat = new PdfTextLocator(s);
        PdfTextExtractor.GetTextFromPage(page, strat);
        foreach (TextChunk c in strat.ResultCoordinates) 
        {
            if (c.Text == s)
                return c.ResultCoordinates;
        }

        return null;
    }

    public PdfTextLocator(string textToSearchFor)
    {
        this.TextToSearchFor = textToSearchFor;
        ResultCoordinates = new List<TextChunk>();
    }

    public override void EventOccurred(IEventData data, EventType type)
    {
        if (!type.Equals(EventType.RENDER_TEXT))
            return;

        TextRenderInfo renderInfo = (TextRenderInfo)data;
        IList<TextRenderInfo> text = renderInfo.GetCharacterRenderInfos();
        for (int i = 0; i < text.Count; i++) 
        {
            if (text[i].GetText() == TextToSearchFor[0].ToString()) 
            {
                string word = "";
                for (int j = i; j < i + TextToSearchFor.Length && j < text.Count; j++) 
                {
                    word = word + text[j].GetText();
                }
                
                float startX = text[i].GetBaseline().GetStartPoint().Get(0);
                float startY = text[i].GetBaseline().GetStartPoint().Get(1);
                ResultCoordinates.Add(new TextChunk(word, new Rectangle(startX, startY, text[i].GetAscentLine().GetEndPoint().Get(0) - startX, text[i].GetAscentLine().GetEndPoint().Get(0) - startY)));
            }
        }
    }

}

public class TextChunk 
{
    public string Text { get; set; }
    public Rectangle ResultCoordinates { get; set; }
    public TextChunk(string s, Rectangle r) 
    {
        Text = s;
        ResultCoordinates = r;
    }
}

如何从 iText 7 中的 pdf 页面获取文本位置

How to get the text position from the pdf page in iText 7

itext7

TextPlusY

TextPlusYExtractionStrategy

用法