如何将pdf数据提取到excel?

How to extract pdf data into excel?

我想将 pdf 数据转换成 excel 数据。我已将 pdf 转换为文本文件,并删除了 .txt 文件中不必要的文本,但它们现在成行,但我希望它们按列排列。

PDF 文件:chemistry-chemists.com/chemister/Spravochniki/handbook-of-aqueous-solubility-data-2010.pdf

excel 文件的当前状态:

excel 文件的所需状态:

PDFtables.com 擅长从 PDF 中提取表格到 Excel。这应该能够满足您的需求:)

在ASP.NET中你可以顺便使用那个代码

    <div>
    Upload PDF File :<asp:FileUpload ID="fuPdfUpload" runat="server" />
    <asp:Button ID="btnExportToExcel" Text="Export To Excel" OnClick="ExportToExcel" runat="server" />
</div>

!!你必须从 NuGet 实现 iTextSharp!!

protected void ExportToExcel(object sender, EventArgs e)
        {
            if (this.fuPdfUpload.HasFile)
            {
                string file = Path.GetFullPath(fuPdfUpload.PostedFile.FileName);
                this.ExportPDFToExcel(file);
            }
        }

        private void ExportPDFToExcel(string fileName)
        {
            StringBuilder text = new StringBuilder();
            PdfReader pdfReader = new PdfReader(fileName);

            for (int page = 1; page <= pdfReader.NumberOfPages; page++)
            {
                ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
                string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
                currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
                text.Append(currentText);

            }

            pdfReader.Close();
            Response.Clear();
            Response.Buffer = true;
            Response.AddHeader("content-disposition", "attachment;filename=ReceiptExport.xls");
            Response.Charset = "";
            Response.ContentType = "application/vnd.ms-excel";
            Response.Write(text);
            Response.Flush();
            Response.End();
        }

看看 Tabula,这是一个非常有效的工具,可以从 pdf 转换 table:https://github.com/tabulapdf/tabula