iText7 以错误的顺序读出行
iText7 reading out lines in a wrong order
我正在尝试读出 pdf 文档table,但我遇到了问题。
如果我经常打开PDF
显示为:
item[tab]item[tab]item[tab]item[tab]item
item[tab]item[tab]item[tab]item[tab]item
item[tab]item[tab]item[tab]item[tab]item
我使用以下方法转换 PDF:
StringBuilder result = new StringBuilder();
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
for (int i = 1; i <= pdfDoc.GetNumberOfPages(); i++)
{
result.AppendLine("INFO_START_PAGE");
string output = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(i));
/*Note, in the GetTextFromPage i replaced the method to output [tab] instead of a regular space on
big spaces*/
foreach(string data in output.Replace("\r\n", "\n").Replace("\n", "×").Split('×'))
{
result.AppendLine(data.Trim().Replace(" ", "[tab]"));
}
result.AppendLine("INFO_END_PAGE");
}
pdfDoc.Close();
return result.ToString();
在某些情况下,当我尝试使用 Pdf 到文本转换来读出它时,它显示为:
item[tab]item[tab]item[tab]item[tab]item
item[tab]item[tab]item[tab]
item[tab]item
item[tab]item[tab]item[tab]item[tab]item
有办法解决这个问题吗?
被提取为
Artikelnr. Omschrijving Aantal
Per stuk Kosten
VERHUUR L. GELEVERDE ARBEID PDC 8 € 43,70 € 349,60
VERHUUR O. GELEVERDE ARBEID PDC 3 € 60,95 € 182,85
VERHUUR L.L. GELEVERDE ARBEID EM 24
€ 32,20 € 772,80
首先,为什么会这样
正如问题评论中推测的那样,确实有一个小的垂直步长,在所有行中,前三列设置在相同的垂直位置,最后两列的垂直位置略有不同,
Row First columns y Last columns y
Heading row 536 535.893
First row 516 516.229
Second row 495 495.478
Third row 475 474.788
特别认识到,被文本提取打断的行是那些 y 位置的 pre-decimal 点数字不同的行(536 对 535、475 对 474),而 pre-decimal 点数不破
原因是 class TextChunkLocationDefaultImp
(默认情况下用于存储文本块位置和比较这些位置的方法)存储块的 y 位置(实际上是它的抽象也适用于非水平书写的文本)在整数变量(private readonly int distPerpendicular
)和测试方法SameLine
中需要distPerpendicular
值相等。
namespace iText.Kernel.Pdf.Canvas.Parser.Listener {
internal class TextChunkLocationDefaultImp : ITextChunkLocation {
...
/// <summary>Perpendicular distance to the orientation unit vector (i.e. the Y position in an unrotated coordinate system).
/// </summary>
/// <remarks>
/// Perpendicular distance to the orientation unit vector (i.e. the Y position in an unrotated coordinate system).
/// We round to the nearest integer to handle the fuzziness of comparing floats.
/// </remarks>
private readonly int distPerpendicular;
...
/// <param name="as">the location to compare to</param>
/// <returns>true is this location is on the the same line as the other</returns>
public virtual bool SameLine(ITextChunkLocation @as) {
...
float distPerpendicularDiff = DistPerpendicular() - @as.DistPerpendicular();
if (distPerpendicularDiff == 0) {
return true;
}
...
}
...
}
}
(实际上,如果所比较的文本块之一的长度为零,SameLine
进一步向下允许有一个小的偏差。显然,长度为零的块有时用于变音标记,这样的标记有时应用在不同的高度。不过,这在您的示例文件中无关紧要。)
如何修复
正如我们在上面看到的,问题是由于 TextChunkLocationDefaultImp.SameLine
的行为造成的。因此,我们必须改变这种行为。不过,通常我们不想更改 iText classes 本身的代码。
幸运的是,LocationTextExtractionStrategy
有一个允许注入 ITextChunkLocationStrategy
实现的构造函数,即 ITextChunkLocation
实例的工厂 object。
因此,对于我们的任务,我们必须编写一个不那么严格的替代 ITextChunkLocation
实现,以及一个生成 ITextChunkLocation
实现实例的 ITextChunkLocationStrategy
实现。
不幸的是,TextChunkLocationDefaultImp
对于 iText 来说是 internal
并且有许多私有变量。因此,我们不能简单地从中派生我们的实现,而是必须将其作为一个整体进行复制和粘贴,并将我们的更改应用于该副本。
因此,
class LaxTextChunkLocationStrategy : LocationTextExtractionStrategy.ITextChunkLocationStrategy
{
public LaxTextChunkLocationStrategy()
{
}
public virtual ITextChunkLocation CreateLocation(TextRenderInfo renderInfo, LineSegment baseline)
{
return new TextChunkLocationLaxImp(baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth());
}
}
class TextChunkLocationLaxImp : ITextChunkLocation
{
private const float DIACRITICAL_MARKS_ALLOWED_VERTICAL_DEVIATION = 2;
private readonly Vector startLocation;
private readonly Vector endLocation;
private readonly Vector orientationVector;
private readonly int orientationMagnitude;
private readonly int distPerpendicular;
private readonly float distParallelStart;
private readonly float distParallelEnd;
private readonly float charSpaceWidth;
public TextChunkLocationLaxImp(Vector startLocation, Vector endLocation, float charSpaceWidth)
{
this.startLocation = startLocation;
this.endLocation = endLocation;
this.charSpaceWidth = charSpaceWidth;
Vector oVector = endLocation.Subtract(startLocation);
if (oVector.Length() == 0)
{
oVector = new Vector(1, 0, 0);
}
orientationVector = oVector.Normalize();
orientationMagnitude = (int)(Math.Atan2(orientationVector.Get(Vector.I2), orientationVector.Get(Vector.I1)) * 1000);
Vector origin = new Vector(0, 0, 1);
distPerpendicular = (int)(startLocation.Subtract(origin)).Cross(orientationVector).Get(Vector.I3);
distParallelStart = orientationVector.Dot(startLocation);
distParallelEnd = orientationVector.Dot(endLocation);
}
public virtual int OrientationMagnitude()
{
return orientationMagnitude;
}
public virtual int DistPerpendicular()
{
return distPerpendicular;
}
public virtual float DistParallelStart()
{
return distParallelStart;
}
public virtual float DistParallelEnd()
{
return distParallelEnd;
}
public virtual Vector GetStartLocation()
{
return startLocation;
}
public virtual Vector GetEndLocation()
{
return endLocation;
}
public virtual float GetCharSpaceWidth()
{
return charSpaceWidth;
}
public virtual bool SameLine(ITextChunkLocation @as)
{
if (OrientationMagnitude() != @as.OrientationMagnitude())
{
return false;
}
int distPerpendicularDiff = DistPerpendicular() - @as.DistPerpendicular();
if (Math.Abs(distPerpendicularDiff) < 2)
{
return true;
}
LineSegment mySegment = new LineSegment(startLocation, endLocation);
LineSegment otherSegment = new LineSegment(@as.GetStartLocation(), @as.GetEndLocation());
return Math.Abs(distPerpendicularDiff) <= DIACRITICAL_MARKS_ALLOWED_VERTICAL_DEVIATION && (mySegment.GetLength() == 0 || otherSegment.GetLength() == 0);
}
public virtual float DistanceFromEndOf(ITextChunkLocation other)
{
return DistParallelStart() - other.DistParallelEnd();
}
public virtual bool IsAtWordBoundary(ITextChunkLocation previous)
{
if (startLocation.Equals(endLocation) || previous.GetEndLocation().Equals(previous.GetStartLocation()))
{
return false;
}
float dist = DistanceFromEndOf(previous);
if (dist < 0)
{
dist = previous.DistanceFromEndOf(this);
//The situation when the chunks intersect. We don't need to add space in this case
if (dist < 0)
{
return false;
}
}
return dist > GetCharSpaceWidth() / 2.0f;
}
internal static bool ContainsMark(ITextChunkLocation baseLocation, ITextChunkLocation markLocation)
{
return baseLocation.GetStartLocation().Get(Vector.I1) <= markLocation.GetStartLocation().Get(Vector.I1) &&
baseLocation.GetEndLocation().Get(Vector.I1) >= markLocation.GetEndLocation().Get(Vector.I1) && Math.
Abs(baseLocation.DistPerpendicular() - markLocation.DistPerpendicular()) <= DIACRITICAL_MARKS_ALLOWED_VERTICAL_DEVIATION;
}
}
现在要让您的代码使用这些 classes,替换
string output = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(i));
来自
LocationTextExtractionStrategy laxStrategy = new LocationTextExtractionStrategy(new LaxTextChunkLocationStrategy());
string output = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(i), laxStrategy);
文本提取结果变为
Artikelnr. Omschrijving Aantal Per stuk Kosten
VERHUUR L. GELEVERDE ARBEID PDC 8 € 43,70 € 349,60
VERHUUR O. GELEVERDE ARBEID PDC 3 € 60,95 € 182,85
VERHUUR L.L. GELEVERDE ARBEID EM 24 € 32,20 € 772,80
正如所愿。
其他问题
如何检查 pdf 以了解行的确切位置
在您提问的评论中
May i ask how you exemined the pdf to know the exact locations of the rows?
我检查页面使用 iText RUPS:
在屏幕截图中选择的流的内容中我发现:
q
...
q
1 0 0 1 60 536 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Artikelnr) Tj
8 0 0 8 31.84 0 Tm
(.) Tj
ET
Q
Q
q
...
q
1 0 0 1 147 536 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Omschrijving) Tj
ET
Q
Q
q
...
q
1 0 0 1 370 536 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Aantal) Tj
ET
Q
Q
q
...
q
1 0 0 1 433.404 535.893 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Per stuk) Tj
ET
Q
Q
q
...
q
1 0 0 1 504.878 535.893 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Kosten) Tj
ET
Q
Q
在您看到的前三个标题之前
1 0 0 1 XXX 536 cm
而在您看到的最后两个标题之前
1 0 0 1 XXX 535.893 cm
由于文本矩阵始终设置为 8 0 0 8 XXX 0 Tm
沿 y 轴没有平移部分,上面的 cm 指令设置坐标系,以便文本是分别绘制在 y 位置 536 或 535.893。
我正在尝试读出 pdf 文档table,但我遇到了问题。
如果我经常打开PDF 显示为:
item[tab]item[tab]item[tab]item[tab]item
item[tab]item[tab]item[tab]item[tab]item
item[tab]item[tab]item[tab]item[tab]item
我使用以下方法转换 PDF:
StringBuilder result = new StringBuilder();
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
for (int i = 1; i <= pdfDoc.GetNumberOfPages(); i++)
{
result.AppendLine("INFO_START_PAGE");
string output = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(i));
/*Note, in the GetTextFromPage i replaced the method to output [tab] instead of a regular space on
big spaces*/
foreach(string data in output.Replace("\r\n", "\n").Replace("\n", "×").Split('×'))
{
result.AppendLine(data.Trim().Replace(" ", "[tab]"));
}
result.AppendLine("INFO_END_PAGE");
}
pdfDoc.Close();
return result.ToString();
在某些情况下,当我尝试使用 Pdf 到文本转换来读出它时,它显示为:
item[tab]item[tab]item[tab]item[tab]item
item[tab]item[tab]item[tab]
item[tab]item
item[tab]item[tab]item[tab]item[tab]item
有办法解决这个问题吗?
被提取为
Artikelnr. Omschrijving Aantal
Per stuk Kosten
VERHUUR L. GELEVERDE ARBEID PDC 8 € 43,70 € 349,60
VERHUUR O. GELEVERDE ARBEID PDC 3 € 60,95 € 182,85
VERHUUR L.L. GELEVERDE ARBEID EM 24
€ 32,20 € 772,80
首先,为什么会这样
正如问题评论中推测的那样,确实有一个小的垂直步长,在所有行中,前三列设置在相同的垂直位置,最后两列的垂直位置略有不同,
Row First columns y Last columns y
Heading row 536 535.893
First row 516 516.229
Second row 495 495.478
Third row 475 474.788
特别认识到,被文本提取打断的行是那些 y 位置的 pre-decimal 点数字不同的行(536 对 535、475 对 474),而 pre-decimal 点数不破
原因是 class TextChunkLocationDefaultImp
(默认情况下用于存储文本块位置和比较这些位置的方法)存储块的 y 位置(实际上是它的抽象也适用于非水平书写的文本)在整数变量(private readonly int distPerpendicular
)和测试方法SameLine
中需要distPerpendicular
值相等。
namespace iText.Kernel.Pdf.Canvas.Parser.Listener {
internal class TextChunkLocationDefaultImp : ITextChunkLocation {
...
/// <summary>Perpendicular distance to the orientation unit vector (i.e. the Y position in an unrotated coordinate system).
/// </summary>
/// <remarks>
/// Perpendicular distance to the orientation unit vector (i.e. the Y position in an unrotated coordinate system).
/// We round to the nearest integer to handle the fuzziness of comparing floats.
/// </remarks>
private readonly int distPerpendicular;
...
/// <param name="as">the location to compare to</param>
/// <returns>true is this location is on the the same line as the other</returns>
public virtual bool SameLine(ITextChunkLocation @as) {
...
float distPerpendicularDiff = DistPerpendicular() - @as.DistPerpendicular();
if (distPerpendicularDiff == 0) {
return true;
}
...
}
...
}
}
(实际上,如果所比较的文本块之一的长度为零,SameLine
进一步向下允许有一个小的偏差。显然,长度为零的块有时用于变音标记,这样的标记有时应用在不同的高度。不过,这在您的示例文件中无关紧要。)
如何修复
正如我们在上面看到的,问题是由于 TextChunkLocationDefaultImp.SameLine
的行为造成的。因此,我们必须改变这种行为。不过,通常我们不想更改 iText classes 本身的代码。
幸运的是,LocationTextExtractionStrategy
有一个允许注入 ITextChunkLocationStrategy
实现的构造函数,即 ITextChunkLocation
实例的工厂 object。
因此,对于我们的任务,我们必须编写一个不那么严格的替代 ITextChunkLocation
实现,以及一个生成 ITextChunkLocation
实现实例的 ITextChunkLocationStrategy
实现。
不幸的是,TextChunkLocationDefaultImp
对于 iText 来说是 internal
并且有许多私有变量。因此,我们不能简单地从中派生我们的实现,而是必须将其作为一个整体进行复制和粘贴,并将我们的更改应用于该副本。
因此,
class LaxTextChunkLocationStrategy : LocationTextExtractionStrategy.ITextChunkLocationStrategy
{
public LaxTextChunkLocationStrategy()
{
}
public virtual ITextChunkLocation CreateLocation(TextRenderInfo renderInfo, LineSegment baseline)
{
return new TextChunkLocationLaxImp(baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth());
}
}
class TextChunkLocationLaxImp : ITextChunkLocation
{
private const float DIACRITICAL_MARKS_ALLOWED_VERTICAL_DEVIATION = 2;
private readonly Vector startLocation;
private readonly Vector endLocation;
private readonly Vector orientationVector;
private readonly int orientationMagnitude;
private readonly int distPerpendicular;
private readonly float distParallelStart;
private readonly float distParallelEnd;
private readonly float charSpaceWidth;
public TextChunkLocationLaxImp(Vector startLocation, Vector endLocation, float charSpaceWidth)
{
this.startLocation = startLocation;
this.endLocation = endLocation;
this.charSpaceWidth = charSpaceWidth;
Vector oVector = endLocation.Subtract(startLocation);
if (oVector.Length() == 0)
{
oVector = new Vector(1, 0, 0);
}
orientationVector = oVector.Normalize();
orientationMagnitude = (int)(Math.Atan2(orientationVector.Get(Vector.I2), orientationVector.Get(Vector.I1)) * 1000);
Vector origin = new Vector(0, 0, 1);
distPerpendicular = (int)(startLocation.Subtract(origin)).Cross(orientationVector).Get(Vector.I3);
distParallelStart = orientationVector.Dot(startLocation);
distParallelEnd = orientationVector.Dot(endLocation);
}
public virtual int OrientationMagnitude()
{
return orientationMagnitude;
}
public virtual int DistPerpendicular()
{
return distPerpendicular;
}
public virtual float DistParallelStart()
{
return distParallelStart;
}
public virtual float DistParallelEnd()
{
return distParallelEnd;
}
public virtual Vector GetStartLocation()
{
return startLocation;
}
public virtual Vector GetEndLocation()
{
return endLocation;
}
public virtual float GetCharSpaceWidth()
{
return charSpaceWidth;
}
public virtual bool SameLine(ITextChunkLocation @as)
{
if (OrientationMagnitude() != @as.OrientationMagnitude())
{
return false;
}
int distPerpendicularDiff = DistPerpendicular() - @as.DistPerpendicular();
if (Math.Abs(distPerpendicularDiff) < 2)
{
return true;
}
LineSegment mySegment = new LineSegment(startLocation, endLocation);
LineSegment otherSegment = new LineSegment(@as.GetStartLocation(), @as.GetEndLocation());
return Math.Abs(distPerpendicularDiff) <= DIACRITICAL_MARKS_ALLOWED_VERTICAL_DEVIATION && (mySegment.GetLength() == 0 || otherSegment.GetLength() == 0);
}
public virtual float DistanceFromEndOf(ITextChunkLocation other)
{
return DistParallelStart() - other.DistParallelEnd();
}
public virtual bool IsAtWordBoundary(ITextChunkLocation previous)
{
if (startLocation.Equals(endLocation) || previous.GetEndLocation().Equals(previous.GetStartLocation()))
{
return false;
}
float dist = DistanceFromEndOf(previous);
if (dist < 0)
{
dist = previous.DistanceFromEndOf(this);
//The situation when the chunks intersect. We don't need to add space in this case
if (dist < 0)
{
return false;
}
}
return dist > GetCharSpaceWidth() / 2.0f;
}
internal static bool ContainsMark(ITextChunkLocation baseLocation, ITextChunkLocation markLocation)
{
return baseLocation.GetStartLocation().Get(Vector.I1) <= markLocation.GetStartLocation().Get(Vector.I1) &&
baseLocation.GetEndLocation().Get(Vector.I1) >= markLocation.GetEndLocation().Get(Vector.I1) && Math.
Abs(baseLocation.DistPerpendicular() - markLocation.DistPerpendicular()) <= DIACRITICAL_MARKS_ALLOWED_VERTICAL_DEVIATION;
}
}
现在要让您的代码使用这些 classes,替换
string output = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(i));
来自
LocationTextExtractionStrategy laxStrategy = new LocationTextExtractionStrategy(new LaxTextChunkLocationStrategy());
string output = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(i), laxStrategy);
文本提取结果变为
Artikelnr. Omschrijving Aantal Per stuk Kosten
VERHUUR L. GELEVERDE ARBEID PDC 8 € 43,70 € 349,60
VERHUUR O. GELEVERDE ARBEID PDC 3 € 60,95 € 182,85
VERHUUR L.L. GELEVERDE ARBEID EM 24 € 32,20 € 772,80
正如所愿。
其他问题
如何检查 pdf 以了解行的确切位置
在您提问的评论中
May i ask how you exemined the pdf to know the exact locations of the rows?
我检查页面使用 iText RUPS:
在屏幕截图中选择的流的内容中我发现:
q
...
q
1 0 0 1 60 536 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Artikelnr) Tj
8 0 0 8 31.84 0 Tm
(.) Tj
ET
Q
Q
q
...
q
1 0 0 1 147 536 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Omschrijving) Tj
ET
Q
Q
q
...
q
1 0 0 1 370 536 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Aantal) Tj
ET
Q
Q
q
...
q
1 0 0 1 433.404 535.893 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Per stuk) Tj
ET
Q
Q
q
...
q
1 0 0 1 504.878 535.893 cm
BT
8 0 0 8 0 0 Tm
/F3 1 Tf
(Kosten) Tj
ET
Q
Q
在您看到的前三个标题之前
1 0 0 1 XXX 536 cm
而在您看到的最后两个标题之前
1 0 0 1 XXX 535.893 cm
由于文本矩阵始终设置为 8 0 0 8 XXX 0 Tm
沿 y 轴没有平移部分,上面的 cm 指令设置坐标系,以便文本是分别绘制在 y 位置 536 或 535.893。