如何从 TextRenderInfo 获取字体 height/weight？

Question

当我使用 iText(Sharp) 解析现有 PDF 时，我创建了一个实现 IRenderListener 的对象，我将其传递给 PdfReaderContentParser.ProcessContent() 并且果然，我的对象的 RenderText() 被重复调用所有PDF 中的文本。

问题是，TextRenderInfo 告诉我基本字体（在我的例子中是 Helvetica），但我无法说出字体的高度及其粗细（常规与粗体）。这是 iText(Sharp) 的已知缺陷还是我遗漏了什么？

Answer 1

the TextRenderInfo tells me about the base font (in my case, Helvetica) but I can't tell the height of the font nor its weight (regular vs. bold)

身高

不幸的是，iTextSharp 没有提供 public 字体大小方法或 TextRenderInfo 中的成员。有些人通过使用 GetAscentLine() 和 GetDescentLine().

之间的距离来解决这个问题

不过，如果您准备好使用 Reflection，您可以通过公开和使用私有 TextRenderInfo 成员 GraphicsState gs 来做得更好，例如就像在这个渲染监听器中一样：

public class LocationTextSizeExtractionStrategy : LocationTextExtractionStrategy
{
    //Hold each coordinate
    public List<SizeAndTextAndFont> myChunks = new List<SizeAndTextAndFont>();

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo wholeRenderInfo)
    {
        base.RenderText(wholeRenderInfo);
        GraphicsState gs = (GraphicsState) GsField.GetValue(wholeRenderInfo);
        myChunks.Add(new SizeAndTextAndFont(gs.FontSize, wholeRenderInfo.GetText(), wholeRenderInfo.GetFont().PostscriptFontName));
    }

    FieldInfo GsField = typeof(TextRenderInfo).GetField("gs", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
}

//Helper class that stores our rectangle, text, and font
public class SizeAndTextAndFont
{
    public float Size;
    public String Text;
    public String Font;
    public SizeAndTextAndFont(float size, String text, String font)
    {
        this.Size = size;
        this.Text = text;
        this.Font = font;
    }
}

您可以使用这样的渲染侦听器提取信息：

using (var pdfReader = new PdfReader(testFile))
{
    // Loop through each page of the document
    for (var page = startPage; page < endPage; page++)
    {
        Console.WriteLine("\n    Page {0}", page);

        LocationTextSizeExtractionStrategy strategy = new LocationTextSizeExtractionStrategy();
        PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

        foreach (SizeAndTextAndFont p in strategy.myChunks)
        {
            Console.WriteLine(string.Format("<{0}> in {2} at {1}", p.Text, p.Size, p.Font));
        }
    }
}

这会产生如下输出：

    Page 1
<        The Philippine Stock Exchange, Inc> in Helvetica-Bold at 8
<       Daily Quotations Report> in Helvetica-Bold at 8
<       March 23 , 2015> in Helvetica-Bold at 8
<Name> in Helvetica at 7
<Symbol> in Helvetica at 7
<Bid> in Helvetica at 7
[...]

考虑转换

您在输出中看到的字体大小数字是绘制相应文本时 PDF 图形状态中的字体大小值属性。

由于 PDF 的灵活性，这可能不是您最终在输出中看到的字体大小，不过，自定义转换可能会大大拉伸输出。一些 PDF 制作者甚至总是使用 1 的字体大小和转换来相应地拉伸输出。

要在此类文档中获得合适的字体大小值，您可以像这样改进 LocationTextSizeExtractionStrategy 方法 RenderText：

public override void RenderText(TextRenderInfo wholeRenderInfo)
{
    base.RenderText(wholeRenderInfo);
    GraphicsState gs = (GraphicsState) GsField.GetValue(wholeRenderInfo);
    Matrix textToUserSpaceTransformMatrix = (Matrix) TextToUserSpaceTransformMatrixField.GetValue(wholeRenderInfo);
    float transformedFontSize = new Vector(0, gs.FontSize, 0).Cross(textToUserSpaceTransformMatrix).Length;

    myChunks.Add(new SizeAndTextAndFont(transformedFontSize, wholeRenderInfo.GetText(), wholeRenderInfo.GetFont().PostscriptFontName));
}

有了这个额外的反映 FieldInfo 成员。

FieldInfo TextToUserSpaceTransformMatrixField = typeof(TextRenderInfo).GetField("textToUserSpaceTransformMatrix", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);

体重

正如您在上面的输出中看到的，字体名称可能不仅包含字体系列名称，还包含权重指示符

<       March 23 , 2015> in Helvetica-Bold at 8

因此，在您的示例中，

the TextRenderInfo tells me about the base font (in my case, Helvetica)

没有任何装饰的 Helvetica 表示正常重量。

Helvetica 是每个 PDF 查看器都必须提供的开箱即用的标准 14 种字体之一：Times-Roman、Helvetica、Courier、Symbol、Times-Bold、Helvetica-Bold、Courier-Bold、ZapfDingbats , Times-Italic, Helvetica-Oblique, Courier-Oblique, Times-BoldItalic, Helvetica-BoldOblique, Courier-BoldOblique.因此，这些名称非常可靠。

不幸的是，一般的字体名称可能是任意选择的；粗体字体的名称中可能有 "Bold" 或 "Black" 或其他粗体指示符，或者根本没有 none。

人们还可以尝试使用指定条目 FontWeight 的字体 FontDescriptor 字典。不幸的是，这个条目是可选的，你根本不能指望它在那里。

此外，PDF 中的字体可以人为加粗，参见。 this answer:

所有这些数字都是使用相同的字体绘制的，只是增加了一个上升的轮廓线宽度。

因此，恐怕没有可靠的方法可以找到确切的字体粗细，只能通过一些试探法得出 return 可接受的近似值。

如何从 TextRenderInfo 获取字体 height/weight？

Get font height/weight from TextRenderInfo how?

pdf

fonts

itext

身高

考虑转换

体重