使用 iText 从 pdf 文件读取 json 时出错

Question

我一直在尝试从 pdf 文件中读取 JSON。我可以将 JSON 字符串写入 pdf，但是当我阅读 pdf 时，出现如下错误。

Caused by: com.google.gson.stream.MalformedJsonException: Unterminated object at line 60 column 3 path $.All_Routes[0].route_data

我在写入文件之前打印了 JSON 并使用 JSON 验证器对其进行了在线验证，它是 有效的 JSON 但是之后我写到pdf，变成无效。我刚刚从 pdf 中复制了 JSON 并在线验证了它，但它没有被验证并且给出了错误。

这是将 JSON 写入 pdf 文件的代码。

try {
    File file = AppUtils.createFile(".pdf");
    Document document = new Document();
    document.setPageSize(PageSize.A4);
    document.addCreationDate();
    document.addAuthor("Me");
    PdfWriter.getInstance(document, new FileOutputStream(file));
    document.open();

    String jsonBody = new Gson().toJson(backUpModel);

    Gson gson = new GsonBuilder().setPrettyPrinting().create();
    JsonParser parser = new JsonParser();
    JsonElement jsonElement = parser.parse(jsonBody);
    String prettyJsonBody = gson.toJson(jsonElement);

    Log.i(Constants.TAG, "Input Json: " + prettyJsonBody);
    document.add(new Paragraph(prettyJsonBody));
    document.close();

    //Toast.makeText(BackUp.this, "Saved Succesfully", Toast.LENGTH_SHORT).show();
} catch (Exception e) {
    e.printStackTrace();
}

这是读取 PDF 文件的代码。

try {
    File exportDir = new File(Environment.getExternalStorageDirectory(), Constants.TAG);
    String filePath = exportDir.getPath() + File.separator + getFileName(fileUri);
    PdfReader pdfReader = new PdfReader(filePath);
    int numberOfPages = pdfReader.getNumberOfPages();
    StringBuilder stringBuilder = new StringBuilder();
    for (int i = 1; i <= numberOfPages; i++) {
        stringBuilder.append(PdfTextExtractor.getTextFromPage(pdfReader, i));
    }
    pdfReader.close();
    String jsonBody = stringBuilder.toString();
    BackUpModel backUpModel = new Gson().fromJson(jsonBody, BackUpModel.class);
} catch (IOException e) {
    e.printStackTrace();
}

任何人都可以建议我解决问题的可能解决方案吗？

谢谢

Answer 1

将输入 json 与输出进行比较，很明显您无法从当前代码生成的 PDF 中忠实地提取 json。

在将字符串呈现为 PDF 添加换行符以防止文本运行进入页边距时会出现问题。结果中的每个换行符可能已经在输入字符串中，或者可能已经由 iText 引入，并且通常无法识别这种情况。

如果 iText 在白色space 或标点符号（冒号、逗号、括号）外部处换行，这些额外的换行符不会更改 json 对象的含义，但名称和值中的换行符是另一回事。

即使我们可以假设名称或值中没有任何换行符（实际上您共享的 json 中的值中有换行符，但这些换行符可能已经悄悄出现到你分享它的方式），因此，我们可以简单地删除它们，其中一些换行符已应用在原始值中有 space 的地方，而其他则没有。在 space 处断行的地方，space 被丢弃，不再出现在最终输出中。同样，一般来说，只有手边提取的输出是无法识别的。

因此，忠实提取是不可能的。

因此，您必须更改在 PDF 中嵌入 json 的方式。由于您根本没有提及您为什么这样做以及您有哪些替代选择，我无法给出最终建议，仅提供一些可能与您的要求兼容或不兼容的选项：

嵌入 json 不是作为常规的静态页面内容，而是作为多行表单文本字段的值。可以忠实地从 PDF 中提取表单字段中的值。
除了页面内容中可见的json外，还将json嵌入到PDF的私有流对象中；然后，您可以忠实地从该流对象中提取 json。
使用非常小的字体，以便在渲染过程中 iText 不会添加换行符。（不过，如果不放大阅读，结果很可能太小了。）
手动渲染 json（使用低级 iText API）并以某种方式标记您添加的换行符和删除的 spaces。在提取过程中，您必须对这些标记做出反应。

例如，要实施选项 1，将 json 作为多行表单文本字段的值嵌入 ，只需像这样添加它：

Document document = new Document();
document.setPageSize(PageSize.A4);
document.addCreationDate();
document.addAuthor("Me");
PdfWriter pdfWriter = PdfWriter.getInstance(document, new FileOutputStream(jsonPdfFile));
document.open();
pdfWriter.getAcroForm().setNeedAppearances(true);
TextField textField = new TextField(pdfWriter, document.getPageSize(), "json");
textField.setOptions(TextField.MULTILINE | TextField.READ_ONLY);
PdfFormField field = textField.getTextField();
field.setValueAsString(originalJson);
pdfWriter.addAnnotation(field);
document.close();

然后像这样再次提取它：

PdfReader pdfReader = new PdfReader(jsonPdfFile.getAbsolutePath());
String jsonBody = pdfReader.getAcroFields().getField("json");
pdfReader.close();

(ExtractJson 测试 testJsonToPdfToJsonFormField)

_{我正在使用当前的 iText 5.5.14-SNAPSHOT 开发分支。不过，该代码应该适用于任何 5.5.x 版本。}

使用 iText 从 pdf 文件读取 json 时出错

Error while reading json from pdf file using iText

android

json

itext

gson