在 Azure Blob 存储中使用 Azure 搜索 PDF

Using Azure Search for PDFs in Azure Blob Storage

我们正在尝试启用全文搜索。应用程序将 PDF 文件存储在 Azure Blob 存储中，这是 Azure 搜索的数据源。其中大部分工作正常，但索引器无法从几个 PDF 中提取文本。 Azure 搜索索引器是否可以提取任何特定类型的 PDF？如果是，它们是什么？

任何信息，Help/Support 在这方面非常感谢。

Are there any specific kinds of PDFs that Azure Search Indexer can extract?

根据我的经验，没有特定类型的 PDF 是 Azure 搜索索引器无法提取的。根据您的描述，我认为它达到了 Azure 搜索限制。更详细的信息请参考Indexing Documents in Azure Blob Storage with Azure Search.

Azure Search limits how much text it extracts depending on the pricing tier: 32,000 characters for Free tier, 64,000 for Basic, and 4 million for Standard, Standard S2 and Standard S3 tiers. A warning is included in the indexer status response for truncated documents.

Azure 搜索可以从 PDF text elements 中提取所有文本。从嵌入式图像（需要 OCR）或表格中提取文本尚未集成到 Azure 搜索中，但已在路线图上。

如果您的 PDF 包含图像并且您还想从中提取文本，那么您可以尝试按照以下步骤操作 here。

我最近写了一篇博客 post 来讲述我在这方面的经历。我最终在 Azure 中的 Docker 容器中使用了基于 python 的脚本运行 OCR/searchability)

http://martyice.github.io/docker-in-azure/

在 Azure Blob 存储中使用 Azure 搜索 PDF

Using Azure Search for PDFs in Azure Blob Storage

azure-cognitive-search

azure-blob-storage