iTextSharp 解析 table

Question

使用 iTextSharp v5.5.13

我有大量 PDF 文件需要解析。其中大约 5% 的 table 包含我也需要的数据。

table 看起来像这样：

大多数时候我需要的行被解析为
2 januari 15 januari € 49,49 € 21,57 € 15,09 € 34,39

我可以处理。我按 space 拆分，它有效。
但有时月份名称会多出一个space：janu ari

我知道我可以覆盖策略来摆脱这些额外的 space。我已经将它与 pdf 的其余部分一起使用 (ITextExtractionStrategy)，但是对于这个 table，我使用的是矩形策略：

var rect = new System.util.RectangleJ(70, 425, 460, 200);
RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
ITextExtractionStrategy strategy =
    new FilteredTextRenderListener(new MyLocationTextExtractionStrategy(), filter);
var lines = PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy).Split('\n');

我的替代看起来像这样：

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
    protected override bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk)
    {
        var dist = chunk.DistanceFromEndOf(previousChunk);
        return dist < -chunk.CharSpaceWidth || dist > chunk.CharSpaceWidth / 2.0f;
    }
}

我在谷歌上找到了这个。但这并没有解决我的问题。
在 janu ari dist 大于 -chunk.CharSpaceWidth 的情况下，我不确定下一步该怎么做。

请告诉我什么时候我不应该为此使用矩形策略 table 而应该使用不同的方法。

Answer 1

如果这种类型的 table 中的数据总是采用相同的格式，那么您可以采取不同的方法：只接受 MyLocationTextExtractionStrategy 向您抛出的任何数据，然后将该数据整理成您可以使用的格式。

在这种情况下，您的数据始终是：

2组：
- 1 或 2 位数字（一个月中的第几天）
- 一些字符（月份名称）
4组：
- 欧元符号
- 一些数字（至少一个）
- 逗号
- 2 位数

在2 januari 15 januari € 49,49 € 21,57 € 15,09 € 34,39中，空格是分隔符，但是对于结构良好的数据，您甚至不需要空格。因此，只需删除它们，然后您的数据就会变成 2januari15januari€49,49€21,57€15,09€34,39.

现在您可以将正则表达式与一些捕获组一起使用，将您的数据整理成漂亮的东西table。

2组：
- [0-9]{1,2}
- [a-z]*
4组：
- €
- [0-9]{1,}
- ,
- [0-9]{2}

正如您在评论中所写，一个可能的结果正则表达式可能是：

new Regex(@"([0-9]{1,2})([a-z]*)([0-9]{1,2})([a-z]*)(€[0-9]{1,},[0-9]{2})(€[0-9]{1,},[0-9]{2})(€[0-9]{1,},[0-9]{2})(€[0-9]{1,},[0-9]{2})")

iTextSharp 解析 table

iTextSharp parse table

pdf

itext