如何在 PDF 中查找图形标题？

Question

我想开发一个 Python 脚本，可以找到 PDF 中的所有图形标题。我想知道是否可以收集所有图形标题并将它们附加到一个数组中，因为它正在搜索新的图形标题。

我尝试搜索 "Figure" 这个词，然后抓取其中的整个句子，但效率不高，因为它无法找到标题中的所有句子，而是, 只有用句点分隔的句子。

编辑以下是我打算使用的示例 PDF。如您所见，Fig.1 一词写在图像的正下方。

新编辑 这是使用 pdf2htmlEX 转换的完整 HTML 文件： https://drive.google.com/open?id=1hYriVrTlwmxR35A2Jy7mKoO4ns2oWe3Z

Answer 1

这个答案不完整，我们会在解决问题时更新它。

原始 PDF 副本：

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC335638/pdf/pnas00677-0355.pdf

第 1 步 - 尝试 pypdf

# importing required modules 
import PyPDF2 

# creating a pdf file object 
pdfFileObj = open('example.pdf', 'rb') 

# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

# printing number of pages in pdf file 
print(pdfReader.numPages) 

# creating a page object 
pageObj = pdfReader.getPage(0) 

# extracting text from page 
print(pageObj.extractText()) 

# closing the pdf file object 
pdfFileObj.close()

这不合适，因为连单词都没有用空格分隔。

步骤 2 - 尝试 pdf2htmlEX

建议我们尝试 https://github.com/coolwanglu/pdf2htmlEX 转换为 html，然后开发适当的选择器与 beautifulsoup4 一起使用。

pdf2htmlex 生成 html，其中每个单词都被标签包围，对我们没有任何帮助。

步骤 3 - 尝试 pdfminer.six

https://github.com/pdfminer/pdfminer.six

虽然还不完美，但好多了：

CIRCUITS IN THE CEREBELLAR CONTROL OF MOVEMENT

BY JOHN C. ECCLES

AMA/ERF INSTITUTE FOR BIOMEDICAL RESEARCH, CHICAGO

Communicated May 16, 1967

Neuroanatomists have generally recognized that the cerebellum provides the greatest challenge in our initial efforts to discern functional meaning in neuronal patterns because there is a stereotyped and simple geometrical arrangement of its Presumably, it is for this reason that there is the unique neuronal constituents. most refined knowledge of microstructure that is available in the central nervous system. The pioneer investigations of Ram6n y Cajall have led in recent times to fascinating developments concerning microstructure, geometrical arrangements, and numerical assessment.2

As shown in Figure 1,3 there are only two kinds of afferent fibers conveying information to the cerebellum, the climbing fibers (cf) and the mossy fibers (mf); and there is only one type of efferent fiber from the cerebellum, the axons of the Purkinje cells (Pc), which terminate in the cerebellar nuclei (cn) and otherwise largely in Deiters' nucleus. The climbing fiber is uniquely distributed to a single

FIG. 1.-Perspective drawing by Fox3 of a part of a folium of the cerebellar cortex. The principal

components are shown in diagrammatic form, and are described in the text.

336

VOL. 58, 1967

PHYSIOLOGY: J. C. ECCLES

337

然后我们可以运行输出这个代码：

import re

# Read In Text
fileName = "sample.txt"
pdfTextfile = open(fileName, "r")
pdfText = pdfTextfile.read()

# Split text into blocks separated by double line break.
blocks = pdfText.split("\n\n")

# Remove all new lines within blocks to remove arbitary line breaks
blocks = map(lambda x : x.replace("\n", ""), blocks)

# Which blocks are figure captions?
captions = []
for block in blocks:
    if re.search('^fig', block, re.IGNORECASE):
        captions.append(block)

# Done!
for caption in captions:
    print(caption)
    print()

这可能需要更多调整，因为 pdfminer.six 的输出不是很完美。

第 4 步 - 尝试 Tesseract

我很好奇 OCR 在这种情况下会有多好。首先将pdf转换为图像。然后安装以下内容：

sudo apt install tesseract-ocr
pip install pyocr

此代码将对图像执行 OCR。

from PIL import Image
import sys

import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)

tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))

langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))

imageFile = "page_1.jpg"

txt = tool.image_to_string(
    Image.open(imageFile),
    lang=lang,
    builder=pyocr.builders.TextBuilder()
)
open("page_1.txt","w").write(txt)

这会产生更好的文本块，但有一些拼写错误：

CIRCUITS IN THE CEREBELLAR CONTROL OF MOVEMENT

By Joun C. Eccuss

AMA/ ERF INSTITUTE FOR BIOMEDICAL RESEARCH, CHICAGO

Communicated May 16, 1967

Neuroanatomists have generally recognized that the cerebellum provides the greatest challenge in our initial efforts to discern functional meaning in neuronal patterns because there is a stereotyped and simple geometrical arrangement of its unique neuronal constituents. Presumably, it is for this reason that there is the most refined knowledge of microstructure that is available in the central nervous system. The pioneer investigations of Ram6n y Cajal! have led in recent times to fascinating developments concerning microstructure, geometrical arrangements, and numerical assessment.’

As shown in Figure 1,* there are only two kinds of afferent fibers conveying information to the cerebellum, the climbing fibers (cf) and the mossy fibers (m/f); and there is only one type of efferent fiber from the cerebellum, the axons of the Purkinje cells (Pc), which terminate in the cerebellar nuclei (en) and otherwise largely in Deiters’ nucleus. The climbing fiber is uniquely distributed to a single

Fic. 1.—Perspective drawing by Fox? of a part of a folium of the cerebellar cortex. The principal components are shown in diagrammatic form, and are described in the text.

336

如何在 PDF 中查找图形标题？

How to find figure captions in a PDF?

python

pdf

figure

python-3.x