在 PDF iText 中得到错误的阿拉伯语翻译

Question

我正在从我的 HTML 字符串生成 PDF 文件，但是当生成 PDF 文件时 HTML 中的内容与 PDF 不匹配。内容是 PDF 是一些随机内容。我在 google 上读到了这个问题，他们建议使用像 %u0627%u0646%u0627%20%u0627%u0633%u0645%u0649%20%u0639%u0628%u062F%u0627%u0644%u0644%u0647 这样的 Unicode 符号。但我将它放入我的 HTML 它正在按原样打印。

相关问题：

package com.example.demo;

import com.itextpdf.html2pdf.ConverterProperties;
import com.itextpdf.html2pdf.HtmlConverter;
import com.itextpdf.styledxmlparser.css.media.MediaDeviceDescription;
import com.itextpdf.styledxmlparser.css.media.MediaType;
import com.itextpdf.html2pdf.resolver.font.DefaultFontProvider;
import com.itextpdf.layout.font.FontProvider;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;

@SpringBootApplication
public class DemoApplication {

    public static void main(String[] args) throws IOException {
        SpringApplication.run(DemoApplication.class, args);
        String htmlSource = getContent();
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
        ConverterProperties converterProperties = new ConverterProperties();
        FontProvider dfp = new DefaultFontProvider(true, false, false);
        dfp.addFont("/Library/Fonts/Arial.ttf");
        converterProperties.setFontProvider(dfp);
        converterProperties.setMediaDeviceDescription(new MediaDeviceDescription(MediaType.PRINT));
        HtmlConverter.convertToPdf(htmlSource, outputStream, converterProperties);
        byte[] bytes = outputStream.toByteArray();
        File pdfFile = new File("java19.pdf");
        FileOutputStream fos = new FileOutputStream(pdfFile);
        fos.write(bytes);
        fos.flush();
        fos.close();
    }

    private static String getContent() {
        return "<!DOCTYPE html>\n" +
                "<html lang=\"en\">\n" +
                "\n" +
                "<head>\n" +
                "    <meta charset=\"UTF-8\">\n" +
                "    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n" +
                "    <meta http-equiv=\"X-UA-Compatible\" content=\"ie=edge\">\n" +
                "    <title>Document</title>\n" +
                "    <style>\n" +
                "      @page {\n" +
                "        margin: 0;\n" +
                "        font-family: arial;\n" +
                "      }\n" +
                "    </style>\n" +
                "</head>\n" +
                "\n" +
                "<body\n" +
                "    style=\"margin: 0;padding: 0;font-family: arial, sans-serif;font-size: 14px;line-height: 125%;width: 100%;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #222222;\">\n" +
                "    <table cellpadding=\"0\" cellspacing=\"0\" width=\"100%\" style=\"background: white; direction: rtl;\">\n" +
                "        <tbody>\n" +
                "            <tr>\n" +
                "                <td style=\"padding: 0 35px;\">\n" +
                "                    <p> انا اسمى عبدالله\n" +
                "                    </p>\n" +
                "                </td>\n" +
                "            </tr>\n" +
                "        </tbody>\n" +
                "    </table>\n" +
                "\n" +
                "</body>\n" +
                "\n" +
                "</html>";
    }
}

Answer 1

确保你的字体支持你需要的字符，如果你在构建期间使用 Maven 资源目录来包含额外的字体，请检查字体文件是否未被过滤（属性替换），因为这会损坏文件：

Answer 2

请检查以确保您的源文件和编译器使用相同的编码，例如UTF-8。我有时会通过包含仅在 unicode 中可用而在其他经典代码页中不可用的字符来检查这一点。

我尝试重现该问题，当运行示例代码时，我在日志记录中收到以下警告：

Cannot find pdfCalligraph module, which was implicitly required by one of the layout properties

这已由 Alexsey Subach 提到，可能会导致以下问题：

文本方向问题（我不是阿拉伯语专家，但文本右对齐）
字符组合错误（详见本文档：https://itextpdf.com/sites/default/files/2018-12/iText_pdfCalligraph_4pager.pdf）

这是我在没有 pdfCalligraph 的情况下得到的输出：

pdf result without calligraph

使用 this repository

上的代码库创建

因此，为了让一切都像您的浏览器使用 HTML 阿拉伯语一样完美地工作，您还需要：

https://itextpdf.com/en/products/itext-7/pdfcalligraph
加载许可文件的代码（否则您将得到一个 LicenseFileNotLoadedException ）
这个依赖关系https://repo.itextsupport.com/releases/com/itextpdf/typography/2.0.6/

您的问题被标记为关于 iText7，但可能还有其他 possible free alternatives depending on your requirements like Apache FOP that should work with Arabic Ligatures according to this source 但可能需要返工，因为它基于 XSL-FO。从理论上讲，您可以使用您当前使用的任何模板机制生成 XSL-FO，例如：JSP/JSF/Thymeleaf 等，并在请求期间使用类似 ServletFilter 的东西将 XSL-FO 即时转换为 PDF（在网络应用程序）

Answer 3

如果没有看到错误的输出，很难确定问题到底是什么。但是您的 "random content" 听起来像是编码问题。

由于您的源代码中直接包含阿拉伯语内容，因此您必须小心编码。例如，使用 ISO-8859-1，生成的 PDF 输出为：

使用 Unicode 转义序列 (\uXXXX)，您确实可以避免其中的一些编码问题。正在替换

"                    <p> انا اسمى عبدالله\n" +

和

"                    <p>\u0627\u0646\u0627 \u0627\u0633\u0645\u0649 \u0639\u0628\u062F\u0627\u0644\u0644" +

生成阿拉伯字形，即使使用 ISO-8859-1 编码也是如此。或者，无论是否使用 Unicode 转义序列，您都可以使用 UTF-8 来获取正确的内容。

解决编码问题后，您可能会得到如下输出：

为了正确呈现某些书写系统，iText 7 需要一个可选模块 pdfCalligraph。启用此模块后，生成的输出如下所示：

以上测试使用的代码：

public static void main(String[] args) throws IOException {
    // Needed for pdfCalligraph
    LicenseKey.loadLicenseFile("all-products.xml");

    File pdfFile = new File("java19.pdf");
    OutputStream outputStream = new FileOutputStream(pdfFile);
    String htmlSource = getContent();
    ConverterProperties converterProperties = new ConverterProperties();
    FontProvider dfp = new DefaultFontProvider(true, false, false);
    dfp.addFont("/Library/Fonts/Arial.ttf");
    converterProperties.setFontProvider(dfp);
    converterProperties.setMediaDeviceDescription(new MediaDeviceDescription(MediaType.PRINT));
    HtmlConverter.convertToPdf(htmlSource, outputStream, converterProperties);
}

private static String getContent() {
    return "<!DOCTYPE html>\n" +
            "<html lang=\"en\">\n" +
            "\n" +
            "<head>\n" +
            "    <meta charset=\"UTF-8\">\n" +
            "    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n" +
            "    <meta http-equiv=\"X-UA-Compatible\" content=\"ie=edge\">\n" +
            "    <title>Document</title>\n" +
            "    <style>\n" +
            "      @page {\n" +
            "        margin: 0;\n" +
            "        font-family: arial;\n" +
            "      }\n" +
            "    </style>\n" +
            "</head>\n" +
            "\n" +
            "<body\n" +
            "    style=\"margin: 0;padding: 0;font-family: arial, sans-serif;font-size: 14px;line-height: 125%;width: 100%;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #222222;\">\n" +
            "    <table cellpadding=\"0\" cellspacing=\"0\" width=\"100%\" style=\"background: white; direction: rtl;\">\n" +
            "        <tbody>\n" +
            "            <tr>\n" +
            "                <td style=\"padding: 0 35px;\">\n" +
// Arabic content
//            "                    <p> انا اسمى عبدالله\n" +
// Arabic content with Unicode escape sequences
            "                    <p>\u0627\u0646\u0627 \u0627\u0633\u0645\u0649 \u0639\u0628\u062F\u0627\u0644\u0644\u0647" +
            "                    </p>\n" +
            "                </td>\n" +
            "            </tr>\n" +
            "        </tbody>\n" +
            "    </table>\n" +
            "\n" +
            "</body>\n" +
            "\n" +
            "</html>";
}

在 PDF iText 中得到错误的阿拉伯语翻译

getting wrong arabic translation in PDF iText

java

itext7