iText7 以错误的顺序读出行

iText7 reading out lines in a wrong order

我正在尝试读出 pdf 文档table,但我遇到了问题。

如果我经常打开PDF 显示为:

item[tab]item[tab]item[tab]item[tab]item
item[tab]item[tab]item[tab]item[tab]item
item[tab]item[tab]item[tab]item[tab]item

Reference

我使用以下方法转换 PDF:

StringBuilder result = new StringBuilder();
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));

LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();

PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
for (int i = 1; i <= pdfDoc.GetNumberOfPages(); i++)
{
    result.AppendLine("INFO_START_PAGE");
    string output = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(i));
    /*Note, in the GetTextFromPage i replaced the method to output [tab] instead of a regular space on 
    big spaces*/
    foreach(string data in output.Replace("\r\n", "\n").Replace("\n", "×").Split('×'))
    {
        result.AppendLine(data.Trim().Replace("   ", "[tab]"));
    }

    result.AppendLine("INFO_END_PAGE");
}

pdfDoc.Close();
return result.ToString();

在某些情况下,当我尝试使用 Pdf 到文本转换来读出它时,它显示为:

item[tab]item[tab]item[tab]item[tab]item
item[tab]item[tab]item[tab]
item[tab]item
item[tab]item[tab]item[tab]item[tab]item

有办法解决这个问题吗?

被提取为

Artikelnr. Omschrijving Aantal
Per stuk Kosten
VERHUUR L. GELEVERDE ARBEID PDC 8 € 43,70 € 349,60
VERHUUR O. GELEVERDE ARBEID PDC 3 € 60,95 € 182,85
VERHUUR L.L. GELEVERDE ARBEID EM 24
€ 32,20 € 772,80

首先,为什么会这样

正如问题评论中推测的那样,确实有一个小的垂直步长,在所有行中,前三列设置在相同的垂直位置,最后两列的垂直位置略有不同,

    Row       First columns y   Last columns y
Heading row               536          535.893
First row                 516          516.229
Second row                495          495.478
Third row                 475          474.788

特别认识到,被文本提取打断的行是那些 y 位置的 pre-decimal 点数字不同的行(536 对 535、475 对 474),而 pre-decimal 点数不破

原因是 class TextChunkLocationDefaultImp(默认情况下用于存储文本块位置和比较这些位置的方法)存储块的 y 位置(实际上是它的抽象也适用于非水平书写的文本)在整数变量(private readonly int distPerpendicular)和测试方法SameLine中需要distPerpendicular值相等。

namespace iText.Kernel.Pdf.Canvas.Parser.Listener {
    internal class TextChunkLocationDefaultImp : ITextChunkLocation {
        ...
        /// <summary>Perpendicular distance to the orientation unit vector (i.e. the Y position in an unrotated coordinate system).
        ///     </summary>
        /// <remarks>
        /// Perpendicular distance to the orientation unit vector (i.e. the Y position in an unrotated coordinate system).
        /// We round to the nearest integer to handle the fuzziness of comparing floats.
        /// </remarks>
        private readonly int distPerpendicular;
        ...
        /// <param name="as">the location to compare to</param>
        /// <returns>true is this location is on the the same line as the other</returns>
        public virtual bool SameLine(ITextChunkLocation @as) {
            ...
            float distPerpendicularDiff = DistPerpendicular() - @as.DistPerpendicular();
            if (distPerpendicularDiff == 0) {
                return true;
            }
            ...
        }
        ...
    }
}

(实际上,如果所比较的文本块之一的长度为零,SameLine 进一步向下允许有一个小的偏差。显然,长度为零的块有时用于变音标记,这样的标记有时应用在不同的高度。不过,这在您的示例文件中无关紧要。)

如何修复

正如我们在上面看到的,问题是由于 TextChunkLocationDefaultImp.SameLine 的行为造成的。因此,我们必须改变这种行为。不过,通常我们不想更改 iText classes 本身的代码。

幸运的是,LocationTextExtractionStrategy 有一个允许注入 ITextChunkLocationStrategy 实现的构造函数,即 ITextChunkLocation 实例的工厂 object。

因此,对于我们的任务,我们必须编写一个不那么严格的替代 ITextChunkLocation 实现,以及一个生成 ITextChunkLocation 实现实例的 ITextChunkLocationStrategy 实现。

不幸的是,TextChunkLocationDefaultImp 对于 iText 来说是 internal 并且有许多私有变量。因此,我们不能简单地从中派生我们的实现,而是必须将其作为一个整体进行复制和粘贴,并将我们的更改应用于该副本。

因此,

class LaxTextChunkLocationStrategy : LocationTextExtractionStrategy.ITextChunkLocationStrategy
{
    public LaxTextChunkLocationStrategy()
    {
    }

    public virtual ITextChunkLocation CreateLocation(TextRenderInfo renderInfo, LineSegment baseline)
    {
        return new TextChunkLocationLaxImp(baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth());
    }
}

class TextChunkLocationLaxImp : ITextChunkLocation
{
    private const float DIACRITICAL_MARKS_ALLOWED_VERTICAL_DEVIATION = 2;
    private readonly Vector startLocation;
    private readonly Vector endLocation;
    private readonly Vector orientationVector;
    private readonly int orientationMagnitude;
    private readonly int distPerpendicular;
    private readonly float distParallelStart;
    private readonly float distParallelEnd;
    private readonly float charSpaceWidth;

    public TextChunkLocationLaxImp(Vector startLocation, Vector endLocation, float charSpaceWidth)
    {
        this.startLocation = startLocation;
        this.endLocation = endLocation;
        this.charSpaceWidth = charSpaceWidth;
        Vector oVector = endLocation.Subtract(startLocation);
        if (oVector.Length() == 0)
        {
            oVector = new Vector(1, 0, 0);
        }
        orientationVector = oVector.Normalize();
        orientationMagnitude = (int)(Math.Atan2(orientationVector.Get(Vector.I2), orientationVector.Get(Vector.I1)) * 1000);
        Vector origin = new Vector(0, 0, 1);
        distPerpendicular = (int)(startLocation.Subtract(origin)).Cross(orientationVector).Get(Vector.I3);
        distParallelStart = orientationVector.Dot(startLocation);
        distParallelEnd = orientationVector.Dot(endLocation);
    }

    public virtual int OrientationMagnitude()
    {
        return orientationMagnitude;
    }

    public virtual int DistPerpendicular()
    {
        return distPerpendicular;
    }

    public virtual float DistParallelStart()
    {
        return distParallelStart;
    }

    public virtual float DistParallelEnd()
    {
        return distParallelEnd;
    }

    public virtual Vector GetStartLocation()
    {
        return startLocation;
    }

    public virtual Vector GetEndLocation()
    {
        return endLocation;
    }

    public virtual float GetCharSpaceWidth()
    {
        return charSpaceWidth;
    }

    public virtual bool SameLine(ITextChunkLocation @as)
    {
        if (OrientationMagnitude() != @as.OrientationMagnitude())
        {
            return false;
        }
        int distPerpendicularDiff = DistPerpendicular() - @as.DistPerpendicular();
        if (Math.Abs(distPerpendicularDiff) < 2)
        {
            return true;
        }
        LineSegment mySegment = new LineSegment(startLocation, endLocation);
        LineSegment otherSegment = new LineSegment(@as.GetStartLocation(), @as.GetEndLocation());
        return Math.Abs(distPerpendicularDiff) <= DIACRITICAL_MARKS_ALLOWED_VERTICAL_DEVIATION && (mySegment.GetLength() == 0 || otherSegment.GetLength() == 0);
    }

    public virtual float DistanceFromEndOf(ITextChunkLocation other)
    {
        return DistParallelStart() - other.DistParallelEnd();
    }

    public virtual bool IsAtWordBoundary(ITextChunkLocation previous)
    {
        if (startLocation.Equals(endLocation) || previous.GetEndLocation().Equals(previous.GetStartLocation()))
        {
            return false;
        }
        float dist = DistanceFromEndOf(previous);
        if (dist < 0)
        {
            dist = previous.DistanceFromEndOf(this);
            //The situation when the chunks intersect. We don't need to add space in this case
            if (dist < 0)
            {
                return false;
            }
        }
        return dist > GetCharSpaceWidth() / 2.0f;
    }

    internal static bool ContainsMark(ITextChunkLocation baseLocation, ITextChunkLocation markLocation)
    {
        return baseLocation.GetStartLocation().Get(Vector.I1) <= markLocation.GetStartLocation().Get(Vector.I1) &&
             baseLocation.GetEndLocation().Get(Vector.I1) >= markLocation.GetEndLocation().Get(Vector.I1) && Math.
            Abs(baseLocation.DistPerpendicular() - markLocation.DistPerpendicular()) <= DIACRITICAL_MARKS_ALLOWED_VERTICAL_DEVIATION;
    }
}

现在要让您的代码使用这些 classes,替换

string output = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(i));

来自

LocationTextExtractionStrategy laxStrategy = new LocationTextExtractionStrategy(new LaxTextChunkLocationStrategy());
string output = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(i), laxStrategy);

文本提取结果变为

Artikelnr. Omschrijving Aantal Per stuk Kosten
VERHUUR L. GELEVERDE ARBEID PDC 8 € 43,70 € 349,60
VERHUUR O. GELEVERDE ARBEID PDC 3 € 60,95 € 182,85
VERHUUR L.L. GELEVERDE ARBEID EM 24 € 32,20 € 772,80

正如所愿。

其他问题

如何检查 pdf 以了解行的确切位置

在您提问的评论中

May i ask how you exemined the pdf to know the exact locations of the rows?

我检查页面使用 iText RUPS:

在屏幕截图中选择的流的内容中我发现:

q
...
q
1 0 0 1 60 536 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Artikelnr) Tj
8 0 0 8 31.84 0 Tm
(.) Tj
ET
Q
Q
q
...
q
1 0 0 1 147 536 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Omschrijving) Tj
ET
Q
Q
q
...
q
1 0 0 1 370 536 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Aantal) Tj
ET
Q
Q
q
...
q
1 0 0 1 433.404 535.893 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Per stuk) Tj
ET
Q
Q
q
...
q
1 0 0 1 504.878 535.893 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Kosten) Tj
ET
Q
Q 

在您看到的前三个标题之前

1 0 0 1 XXX 536 cm

而在您看到的最后两个标题之前

1 0 0 1 XXX 535.893 cm

由于文本矩阵始终设置为 8 0 0 8 XXX 0 Tm 沿 y 轴没有平移部分,上面的 cm 指令设置坐标系,以便文本是分别绘制在 y 位置 536 或 535.893。