使用 iText 将 HTML 转换为 PDF

Question

我发布这个问题是因为许多开发人员以不同的形式问或多或少相同的问题。这个问题我自己来回答（我是iText Group的Founder/CTO），这样就可以"Wiki-answer."如果Stack Overflow"documentation"这个功能还存在的话，这个就不错了文档主题的候选人。

源文件：

我正在尝试将以下 HTML 文件转换为 PDF：

<html>
    <head>
        <title>Colossal (movie)</title>
        <style>
            .poster { width: 120px;float: right; }
            .director { font-style: italic; }
            .description { font-family: serif; }
            .imdb { font-size: 0.8em; }
            a { color: red; }
        </style>
    </head>
    <body>
        <img src="img/colossal.jpg" class="poster" />
        <h1>Colossal (2016)</h1>
        <div class="director">Directed by Nacho Vigalondo</div>
        <div class="description">Gloria is an out-of-work party girl
            forced to leave her life in New York City, and move back home.
            When reports surface that a giant creature is destroying Seoul,
            she gradually comes to the realization that she is somehow connected
            to this phenomenon.
        </div>
        <div class="imdb">Read more about this movie on
            <a href="www.imdb.com/title/tt4680182">IMDB</a>
        </div>
    </body>
</html>

在浏览器中，这个 HTML 看起来像这样：

我遇到的问题：

HTMLWorker 根本没有考虑 CSS

当我使用 HTMLWorker 时，我需要创建一个 ImageProvider 以避免提示找不到图像的错误。我还需要创建一个 StyleSheet 实例来更改一些样式：

public static class MyImageFactory implements ImageProvider {
    public Image getImage(String src, Map<String, String> h,
            ChainedProperties cprops, DocListener doc) {
        try {
            return Image.getInstance(
                String.format("resources/html/img/%s",
                    src.substring(src.lastIndexOf("/") + 1)));
        } catch (DocumentException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }    
}

public static void main(String[] args) throws IOException, DocumentException {
    Document document = new Document();
    PdfWriter.getInstance(document, new FileOutputStream("results/htmlworker.pdf"));
    document.open();
    StyleSheet styles = new StyleSheet();   
    styles.loadStyle("imdb", "size", "-3");
    HTMLWorker htmlWorker = new HTMLWorker(document, null, styles);
    HashMap<String,Object> providers = new HashMap<String, Object>();
    providers.put(HTMLWorker.IMG_PROVIDER, new MyImageFactory());
    htmlWorker.setProviders(providers);
    htmlWorker.parse(new FileReader("resources/html/sample.html"));
    document.close();   
}

结果如下所示：

出于某种原因，HTMLWorker 也显示了 <title> 标签的内容。我不知道如何避免这种情况。 header 中的 CSS 根本没有被解析，我必须在我的代码中定义所有样式，使用 StyleSheet object.

当我查看我的代码时，我发现我正在使用的大量 object 和方法已被弃用：

所以我决定升级到使用 XML Worker。

使用 XML Worker

时找不到图像

我尝试了以下代码：

public static final String DEST = "results/xmlworker1.pdf";
public static final String HTML = "resources/html/sample.html";
public void createPdf(String file) throws IOException, DocumentException {
    Document document = new Document();
    PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
    document.open();
    XMLWorkerHelper.getInstance().parseXHtml(writer, document,
            new FileInputStream(HTML));
    document.close();
}

这导致了以下 PDF：

而不是Times-Roman，使用默认字体Helvetica；这对于 iText 来说很典型（我应该在我的 HTML 中明确定义一种字体）。否则，CSS 似乎是受尊重，但图像丢失，我没有收到错误消息。

使用 HTMLWorker 时抛出异常，我可以通过引入 ImageProvider 来解决问题。让我们看看这是否适用于 XML Worker。

并非所有 CSS 样式都在 XML Worker

中受支持

我这样修改了我的代码：

public static final String DEST = "results/xmlworker2.pdf";
public static final String HTML = "resources/html/sample.html";
public static final String IMG_PATH = "resources/html/";
public void createPdf(String file) throws IOException, DocumentException {
    Document document = new Document();
    PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
    document.open();

    CSSResolver cssResolver =
            XMLWorkerHelper.getInstance().getDefaultCssResolver(true);
    HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);
    htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
    htmlContext.setImageProvider(new AbstractImageProvider() {
        public String getImageRootPath() {
            return IMG_PATH;
        }
    });

    PdfWriterPipeline pdf = new PdfWriterPipeline(document, writer);
    HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
    CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);

    XMLWorker worker = new XMLWorker(css, true);
    XMLParser p = new XMLParser(worker);
    p.parse(new FileInputStream(HTML));

    document.close();
}

我的代码要长得多，但现在图像已渲染：

图像比我使用 HTMLWorker 渲染时大，这告诉我 poster class 的 CSS 属性 width 已被占用考虑在内，但忽略了 float 属性。我该如何解决这个问题？

剩余问题：

所以问题归结为：我有一个 specific HTML 文件，我尝试将其转换为 PDF。我做了很多工作，一个接一个地解决问题，但有一个 特定的 问题我无法解决：如何让 iText 尊重 CSS定义元素的位置，例如 float: right?

附加问题：

当我的HTML包含表单元素（例如<input>）时，这些表单元素将被忽略。

Answer 1

为什么你的代码不起作用

正如在 HTML to PDF tutorial 的介绍中所解释的那样，HTMLWorker 在很多年前就已被弃用。它不是为了转换完整的 HTML 页。它不知道 HTML 页面有 <head> 和 <body> 部分；它只是解析所有内容。它旨在解析小 [=70=] 片段，您可以使用 StyleSheet class; 定义样式。不支持真实 CSS。

然后是XML工人。 XML Worker 是用来解析 XML 的通用框架。作为概念证明，我们决定将一些 XHTML 写入 PDF 功能，但我们并不支持所有 HTML 标签。例如：根本不支持表单，很难支持用于定位内容的 CSS。 HTML 中的表单与 PDF 中的表单有很大不同。 iText 架构与 HTML + CSS 的架构之间也存在不匹配。渐渐地，我们扩展了 XML Worker，主要是根据客户的要求，但是 XML Worker 变成了一个长着很多触手的怪物。

最终，考虑到 HTML + CSS 转换的要求，我们决定从头开始重写 iText。这导致 iText 7. On top of iText 7, we created several add-ons, the most important one in this context being pdfHTML.

如何解决问题

使用最新版本的 iText (iText 7.1.0 + pdfHTML 2.0.0) 将 HTML 从问题转换为 PDF 的代码缩减为以下代码片段：

public static final String SRC = "src/main/resources/html/sample.html";
public static final String DEST = "target/results/sample.pdf";
public void createPdf(String src, String dest) throws IOException {
    HtmlConverter.convertToPdf(new File(src), new File(dest));
}

结果如下所示：

如您所见，这正是您所期望的结果。从 iText 7.1.0 / pdfHTML 2.0.0 开始，默认字体是 Times-Roman。 CSS 受到尊重：图像现在漂浮在右侧。

一些额外的想法。

当我提出升级到 iText 7 / pdf 的建议时，开发人员通常反对升级到更新的 iText 版本HTML 2. 请允许我回答我听到的前 3 个论点：

我需要使用免费的 iText，而 iText 7 不是免费的/pdfHTML 附加组件是闭源的。

iText 7 使用 AGPL 发布，就像 iText 5 和 XML Worker 一样。 AGPL 允许在开源项目的上下文中免费使用 免费。如果您分发的是封闭源/专有产品（例如，您在 SaaS 上下文中使用 iText），则不能免费使用 iText；在这种情况下，您必须购买商业许可证。 iText 5 已经是这样了； iText 7 仍然如此。至于 iText 5 之前的版本：you shouldn't use these at all。关于 pdfHTML：第一个版本确实只能作为闭源软件使用。我们在 iText Group 内部进行了激烈的讨论：一方面，有些人希望避免当那些开发人员告诉当权者开源不是和免费一样。开发人员告诉我们，他们的老板强迫他们做错事，而且他们无法说服他们的老板购买商业许可证。另一方面，有人认为我们不应该因为老板的错误行为而惩罚开发商。最终，支持开源 pdfHTML 的人，即 iText 的开发人员，赢得了争论。请证明他们没有错，并正确使用 iText：如果您使用 iText 免费，请尊重 AGPL；如果您在闭源环境中使用 iText，请确保您的老板购买了商业许可证。

我需要维护旧系统，我必须使用旧的 iText 版本。

认真的？维护还涉及应用升级和迁移到您正在使用的软件的新版本。如您所见，使用 iText 7 和 pdfHTML 时所需的代码非常简单，并且比之前所需的代码更不容易出错。迁移项目不应花费太长时间。

我才刚刚开始，我不知道 iText 7；我是在完成项目后才发现的。

这就是我发布此问答的原因。把自己想象成一个极限程序员。扔掉你所有的代码，重新开始。您会注意到它没有您想象的那么多工作，而且您会睡得更好，因为您知道您已经使您的项目永不过时，因为 iText 5 正在被淘汰。我们仍然为付费客户提供支持，但最终，我们将完全停止支持 iText 5。

Answer 2

使用 iText 7 和此代码：

public void generatePDF(String htmlFile) {
    try {

        //HTML String
        String htmlString = htmlFile;
        //Setting destination 
        FileOutputStream fileOutputStream = new FileOutputStream(new File(dirPath + "/USER-16-PF-Report.pdf"));
        
        PdfWriter pdfWriter = new PdfWriter(fileOutputStream);
        ConverterProperties converterProperties = new ConverterProperties();
        PdfDocument pdfDocument = new PdfDocument(pdfWriter);

        //For setting the PAGE SIZE
        pdfDocument.setDefaultPageSize(new PageSize(PageSize.A3));
        
        Document document = HtmlConverter.convertToDocument(htmlFile, pdfDocument, converterProperties);
        document.close();
    } 
    catch (Exception e) {
         e.printStackTrace();
    }
}

Answer 3

转换静态 HTML 页面也采用任何 CSS 样式：

 HtmlConverter.convertToPdf(new File("./pdf-input.html"),new File("demo-html.pdf"));

对于 spring 引导用户：使用 SpringBoot 和 Thymeleaf 转换动态 HTML 页面：

    @RequestMapping(path = "/pdf")
    public ResponseEntity<?> getPDF(HttpServletRequest request, HttpServletResponse response) throws IOException {
    /* Do Business Logic*/

    Order order = OrderHelper.getOrder();

    /* Create HTML using Thymeleaf template Engine */

    WebContext context = new WebContext(request, response, servletContext);
    context.setVariable("orderEntry", order);
    String orderHtml = templateEngine.process("order", context);

    /* Setup Source and target I/O streams */

    ByteArrayOutputStream target = new ByteArrayOutputStream();
    ConverterProperties converterProperties = new ConverterProperties();
    converterProperties.setBaseUri("http://localhost:8080");
    /* Call convert method */
    HtmlConverter.convertToPdf(orderHtml, target, converterProperties);

    /* extract output as bytes */
    byte[] bytes = target.toByteArray();


    /* Send the response as downloadable PDF */

    return ResponseEntity.ok()
            .header(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename=order.pdf")
            .contentType(MediaType.APPLICATION_PDF)
            .body(bytes);

}

使用 iText 将 HTML 转换为 PDF

Converting HTML to PDF using iText

html

java

pdf

pdf-generation

itext