使用 iText 替换 PDF 文件中的文本

Replace text inside a PDF file using iText

我正在使用 iText(5.5.13) 库读取 .PDF 并替换文件中的模式。问题是没有找到模式,因为当库读取 pdf 时不知何故出现了一些奇怪的字符。

例如,在句子中:

"This is a test in order to see if the"

当我试图阅读它时变成了这个:

[(This is a )9(te)-3(st)9( in o)-4(rd)15(er )-2(t)9(o)-5( s)8(ee)7( if t)-3(h)3(e )]

因此,如果我尝试查找并替换 "test",则不会在 pdf 中找到 "test" 字词,并且不会被替换

这是我正在使用的代码:

public void processPDF(String src, String dest) {

    try {

      PdfReader reader = new PdfReader(src);
      PdfArray refs = null;
      PRIndirectReference reference = null;

      int nPages = reader.getNumberOfPages();

      for (int i = 1; i <= nPages; i++) {
        PdfDictionary dict = reader.getPageN(i);
        PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
        if (object.isArray()) {
          refs = dict.getAsArray(PdfName.CONTENTS);
          ArrayList<PdfObject> references = refs.getArrayList();

          for (PdfObject r : references) {

            reference = (PRIndirectReference) r;
            PRStream stream = (PRStream) PdfReader.getPdfObject(reference);
            byte[] data = PdfReader.getStreamBytes(stream);
            String dd = new String(data, "UTF-8");

            dd = dd.replaceAll("@pattern_1234", "trueValue");
            dd = dd.replaceAll("test", "tested");

            stream.setData(dd.getBytes());
          }

        }
        if (object instanceof PRStream) {
          PRStream stream = (PRStream) object;

          byte[] data = PdfReader.getStreamBytes(stream);
          String dd = new String(data, "UTF-8");
          System.out.println("content---->" + dd);
          dd = dd.replaceAll("@pattern_1234", "trueValue");
          dd = dd.replaceAll("This", "FIRST");

          stream.setData(dd.getBytes(StandardCharsets.UTF_8));
        }
      }
      PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
      stamper.close();
      reader.close();
    }

    catch (Exception e) {
    }
  }

PDF 文件不是文字处理文件。您看到的是字符的显式放置,这些字符被紧排在一起 and/or 许多其他东西。你梦想 "replace" 以这种方式发短信是不可能的,或者说得更好,即使不是不可能,也不太可能。

PDF 是具有字节偏移量的二进制文件。它有很多部分。就像这是在这个字节偏移量处读取这个,然后去那个字节偏移量并读取那个。

您不能只是将 "foo" 替换为 "foobar" 并认为它会起作用。它会破坏所有字节偏移并完全破坏文件。

在询问之前自己尝试一下。

在你上面的例子中,在一些编辑器中打开文件并更改你发布的字符串:

This is a

对此:

WOW Let me change this data around for the content "This is a"

保存该文件并尝试打开它。即便如此,这是一组不跨越您确定的边界的内容也不会起作用。因为它不是文字处理文件。它不是文本文件。它是一个二进制文件,您无法像您想象的那样进行操作。

我找到了这个博客Replacing PDF objects(legacy),大家都说二进制文件很难改。试试这个是否有效。

    PdfReader reader = new PdfReader(src);
    PdfDictionary dict = reader.getPageN(1);
    PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
    if (object instanceof PRStream) {
        PRStream stream = (PRStream)object;
        byte[] data = PdfReader.getStreamBytes(stream);
        stream.setData(new String(data).replace("test", "tested").getBytes());
    }
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
    stamper.close();
    reader.close();

对于 iText 版本 7 检查 Replacing PDF objects

正如评论和答案中已经提到的,PDF 不是一种用于文本编辑 的格式。它是最终格式,有关文本流、布局甚至到 Unicode 的映射的信息都是可选的。

因此,即使假设存在有关将字形映射到 Unicode 的可选信息,使用 iText 完成此任务的方法可能看起来有点不令人满意:第一个方法是使用自定义文本提取策略来确定相关文本的位置,然后继续使用 PdfCleanUpProcessor 删除该位置所有内容的当前内容,最后将替换文本绘制到间隙中。

在这个答案中,我将提供一个助手 class 允许结合前两个步骤,查找和删除现有文本,其优点是确实只有 text 被删除, 不是 任何 背景图形等 就像 PdfCleanUpProcessor 编辑的情况一样。助手还 returns 删除文本的位置,允许在其上标记替换。

助手 class 是基于 this earlier answer. Please use the version of this class on github 中介绍的 PdfContentStreamEditor,不过,由于最初的 class 从构思开始就得到了一些增强。

SimpleTextRemover 助手 class 说明了从 PDF 中正确删除文本的必要条件。其实限制在几个方面:

  • 它只替换实际页面内容流中的文本。

    要同时替换嵌入式 XObject 中的文本,必须以递归方式遍历相关页面的 XObject 资源,并对它们应用编辑器。

  • 它与 SimpleTextExtractionStrategy 一样“简单”:它假定显示说明的文本按阅读顺序出现在内容中。

    同时处理顺序不同且指令必须排序的内容流,这意味着所有传入指令和相关呈现信息必须缓存到页面末尾,而不仅仅是页面末尾的一些指令一个时间。然后可以对渲染信息进行排序,可以在排序后的渲染信息中标识要删除的部分,可以操作关联的指令,最终可以存储指令。

  • 它不会尝试识别在视觉上代表白色的字形之间的间隙 space 而实际上根本没有字形。

    要识别间隙,必须扩展代码以检查两个连续的字形是否完全相继出现,或者是否存在间隙或跳行。

  • 在计算删除字形后留下的间隙时,还没有考虑字符和单词的间距。

    要改进这一点,必须改进字形宽度计算。

不过,考虑到您的内容流中的示例摘录,这些限制可能不会妨碍您。

public class SimpleTextRemover extends PdfContentStreamEditor {
    public SimpleTextRemover() {
        super (new SimpleTextRemoverListener());
        ((SimpleTextRemoverListener)getRenderListener()).simpleTextRemover = this;
    }

    /**
     * <p>Removes the string to remove from the given page of the
     * document in the PDF reader the given PDF stamper works on.</p>
     * <p>The result is a list of glyph lists each of which represents
     * a match can can be queried for position information.</p>
     */
    public List<List<Glyph>> remove(PdfStamper pdfStamper, int pageNum, String toRemove) throws IOException {
        if (toRemove.length()  == 0)
            return Collections.emptyList();

        this.toRemove = toRemove;
        cachedOperations.clear();
        elementNumber = -1;
        pendingMatch.clear();
        matches.clear();
        allMatches.clear();
        editPage(pdfStamper, pageNum);
        return allMatches;
    }

    /**
     * Adds the given operation to the cached operations and checks
     * whether some cached operations can meanwhile be processed and
     * written to the result content stream.
     */
    @Override
    protected void write(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
        cachedOperations.add(new ArrayList<>(operands));

        while (process(processor)) {
            cachedOperations.remove(0);
        }
    }

    /**
     * Removes any started match and sends all remaining cached
     * operations for processing.
     */
    @Override
    public void finalizeContent() {
        pendingMatch.clear();
        try {
            while (!cachedOperations.isEmpty()) {
                if (!process(this)) {
                    // TODO: Should not happen, so warn
                    System.err.printf("Failure flushing operation %s; dropping.\n", cachedOperations.get(0));
                }
                cachedOperations.remove(0);
            }
        } catch (IOException e) {
            throw new ExceptionConverter(e);
        }
    }

    /**
     * Tries to process the first cached operation. Returns whether
     * it could be processed.
     */
    boolean process(PdfContentStreamProcessor processor) throws IOException {
        if (cachedOperations.isEmpty())
            return false;

        List<PdfObject> operands = cachedOperations.get(0);
        PdfLiteral operator = (PdfLiteral) operands.get(operands.size() - 1);
        String operatorString = operator.toString();

        if (TEXT_SHOWING_OPERATORS.contains(operatorString))
            return processTextShowingOp(processor, operator, operands);

        super.write(processor, operator, operands);
        return true;
    }

    /**
     * Tries to processes a text showing operation. Unless a match
     * is pending and starts before the end of the argument of this
     * instruction, it can be processed. If the instructions contains
     * a part of a match, it is transformed to a TJ operation and
     * the glyphs in question are replaced by text position adjustments.
     * If the original operation had a side effect (jump to next line
     * or spacing adjustment), this side effect is explicitly added.
     */
    boolean processTextShowingOp(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
        PdfObject object = operands.get(operands.size() - 2);
        boolean isArray = object instanceof PdfArray;
        PdfArray array = isArray ? (PdfArray) object : new PdfArray(object);
        int elementCount = countStrings(object);

        // Currently pending glyph intersects parameter of this operation -> cannot yet process
        if (!pendingMatch.isEmpty() && pendingMatch.get(0).elementNumber < processedElements + elementCount)
            return false;

        // The parameter of this operation is subject to a match -> copy as is
        if (matches.size() == 0 || processedElements + elementCount <= matches.get(0).get(0).elementNumber || elementCount == 0) {
            super.write(processor, operator, operands);
            processedElements += elementCount;
            return true;
        }

        // The parameter of this operation contains glyphs of a match -> manipulate 
        PdfArray newArray = new PdfArray();
        for (int arrayIndex = 0; arrayIndex < array.size(); arrayIndex++) {
            PdfObject entry = array.getPdfObject(arrayIndex);
            if (!(entry instanceof PdfString)) {
                newArray.add(entry);
            } else {
                PdfString entryString = (PdfString) entry;
                byte[] entryBytes = entryString.getBytes();
                for (int index = 0; index < entryBytes.length; ) {
                    List<Glyph> match = matches.size() == 0 ? null : matches.get(0);
                    Glyph glyph = match == null ? null : match.get(0);
                    if (glyph == null || processedElements < glyph.elementNumber) {
                        newArray.add(new PdfString(Arrays.copyOfRange(entryBytes, index, entryBytes.length)));
                        break;
                    }
                    if (index < glyph.index) {
                        newArray.add(new PdfString(Arrays.copyOfRange(entryBytes, index, glyph.index)));
                        index = glyph.index;
                        continue;
                    }
                    newArray.add(new PdfNumber(-glyph.width));
                    index++;
                    match.remove(0);
                    if (match.isEmpty())
                        matches.remove(0);
                }
                processedElements++;
            }
        }
        writeSideEffect(processor, operator, operands);
        writeTJ(processor, newArray);

        return true;
    }

    /**
     * Counts the strings in the given argument, itself a string or
     * an array containing strings and non-strings.
     */
    int countStrings(PdfObject textArgument) {
        if (textArgument instanceof PdfArray) {
            int result = 0;
            for (PdfObject object : (PdfArray)textArgument) {
                if (object instanceof PdfString)
                    result++;
            }
            return result;
        } else 
            return textArgument instanceof PdfString ? 1 : 0;
    }

    /**
     * Writes side effects of a text showing operation which is going to be
     * replaced by a TJ operation. Side effects are line jumps and changes
     * of character or word spacing.
     */
    void writeSideEffect(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
        switch (operator.toString()) {
        case "\"":
            super.write(processor, OPERATOR_Tw, Arrays.asList(operands.get(0), OPERATOR_Tw));
            super.write(processor, OPERATOR_Tc, Arrays.asList(operands.get(1), OPERATOR_Tc));
        case "'":
            super.write(processor, OPERATOR_Tasterisk, Collections.singletonList(OPERATOR_Tasterisk));
        }
    }

    /**
     * Writes a TJ operation with the given array unless array is empty.
     */
    void writeTJ(PdfContentStreamProcessor processor, PdfArray array) throws IOException {
        if (!array.isEmpty()) {
            List<PdfObject> operands = Arrays.asList(array, OPERATOR_TJ);
            super.write(processor, OPERATOR_TJ, operands);
        }
    }

    /**
     * Analyzes the given text render info whether it starts a new match or
     * finishes / continues / breaks a pending match. This method is called
     * by the {@link SimpleTextRemoverListener} registered as render listener
     * of the underlying content stream processor.
     */
    void renderText(TextRenderInfo renderInfo) {
        elementNumber++;
        int index = 0;
        for (TextRenderInfo info : renderInfo.getCharacterRenderInfos()) {
            int matchPosition = pendingMatch.size();
            pendingMatch.add(new Glyph(info, elementNumber, index));
            if (!toRemove.substring(matchPosition, matchPosition + info.getText().length()).equals(info.getText())) {
                reduceToPartialMatch();
            }
            if (pendingMatch.size() == toRemove.length()) {
                matches.add(new ArrayList<>(pendingMatch));
                allMatches.add(new ArrayList<>(pendingMatch));
                pendingMatch.clear();
            }
            index++;
        }
    }

    /**
     * Reduces the current pending match to an actual (partial) match
     * after the addition of the next glyph has invalidated it as a
     * whole match.
     */
    void reduceToPartialMatch() {
        outer:
        while (!pendingMatch.isEmpty()) {
            pendingMatch.remove(0);
            int index = 0;
            for (Glyph glyph : pendingMatch) {
                if (!toRemove.substring(index, index + glyph.text.length()).equals(glyph.text)) {
                    continue outer;
                }
                index++;
            }
            break;
        }
    }

    String toRemove = null;
    final List<List<PdfObject>> cachedOperations = new LinkedList<>();

    int elementNumber = -1;
    int processedElements = 0;
    final List<Glyph> pendingMatch = new ArrayList<>();
    final List<List<Glyph>> matches = new ArrayList<>();
    final List<List<Glyph>> allMatches = new ArrayList<>();

    /**
     * Render listener class used by {@link SimpleTextRemover} as listener
     * of its content stream processor ancestor. Essentially it forwards
     * {@link TextRenderInfo} events and ignores all else.
     */
    static class SimpleTextRemoverListener implements RenderListener {
        @Override
        public void beginTextBlock() { }

        @Override
        public void renderText(TextRenderInfo renderInfo) {
            simpleTextRemover.renderText(renderInfo);
        }

        @Override
        public void endTextBlock() { }

        @Override
        public void renderImage(ImageRenderInfo renderInfo) { }

        SimpleTextRemover simpleTextRemover = null;
    }

    /**
     * Value class representing a glyph with information on
     * the displayed text and its position, the overall number
     * of the string argument of a text showing instruction
     * it is in and the index at which it can be found therein,
     * and the width to use as text position adjustment when
     * replacing it. Beware, the width does not yet consider
     * character and word spacing!
     */
    public static class Glyph {
        public Glyph(TextRenderInfo info, int elementNumber, int index) {
            text = info.getText();
            ascent = info.getAscentLine();
            base = info.getBaseline();
            descent = info.getDescentLine();
            this.elementNumber = elementNumber;
            this.index = index;
            this.width = info.getFont().getWidth(text);
        }

        public final String text;
        public final LineSegment ascent;
        public final LineSegment base;
        public final LineSegment descent;
        final int elementNumber;
        final int index;
        final float width;
    }

    final PdfLiteral OPERATOR_Tasterisk = new PdfLiteral("T*");
    final PdfLiteral OPERATOR_Tc = new PdfLiteral("Tc");
    final PdfLiteral OPERATOR_Tw = new PdfLiteral("Tw");
    final PdfLiteral OPERATOR_Tj = new PdfLiteral("Tj");
    final PdfLiteral OPERATOR_TJ = new PdfLiteral("TJ");
    final static List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
    final static Glyph[] EMPTY_GLYPH_ARRAY = new Glyph[0];
}

(SimpleTextRemover帮手class)

你可以这样使用它:

PdfReader pdfReader = new PdfReader(SOURCE);
PdfStamper pdfStamper = new PdfStamper(pdfReader, RESULT_STREAM);
SimpleTextRemover remover = new SimpleTextRemover();

System.out.printf("\ntest.pdf - Test\n");
for (int i = 1; i <= pdfReader.getNumberOfPages(); i++)
{
    System.out.printf("Page %d:\n", i);
    List<List<Glyph>> matches = remover.remove(pdfStamper, i, "Test");
    for (List<Glyph> match : matches) {
        Glyph first = match.get(0);
        Vector baseStart = first.base.getStartPoint();
        Glyph last = match.get(match.size()-1);
        Vector baseEnd = last.base.getEndPoint();
        System.out.printf("  Match from (%3.1f %3.1f) to (%3.1f %3.1f)\n", baseStart.get(I1), baseStart.get(I2), baseEnd.get(I1), baseEnd.get(I2));
    }
}

pdfStamper.close();

(RemovePageTextContent 测试 testRemoveTestFromTest)

我的测试文件的控制台输出如下:

test.pdf - Test
Page 1:
  Match from (134,8 666,9) to (177,8 666,9)
  Match from (134,8 642,0) to (153,4 642,0)
  Match from (172,8 642,0) to (191,4 642,0)

以及输出 PDF 中这些位置缺少“测试”的次数。

您可以使用它们在相关位置绘制替换文本,而不是输出匹配坐标。