如何在 PDF 中查找图形标题?
How to find figure captions in a PDF?
我想开发一个 Python 脚本,可以找到 PDF 中的所有图形标题。我想知道是否可以收集所有图形标题并将它们附加到一个数组中,因为它正在搜索新的图形标题。
我尝试搜索 "Figure" 这个词,然后抓取其中的整个句子,但效率不高,因为它无法找到标题中的所有句子,而是, 只有用句点分隔的句子。
编辑
以下是我打算使用的示例 PDF。如您所见,Fig.1 一词写在图像的正下方。
新编辑
这是使用 pdf2htmlEX 转换的完整 HTML 文件:
https://drive.google.com/open?id=1hYriVrTlwmxR35A2Jy7mKoO4ns2oWe3Z
这个答案不完整,我们会在解决问题时更新它。
原始 PDF 副本:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC335638/pdf/pnas00677-0355.pdf
第 1 步 - 尝试 pypdf
# importing required modules
import PyPDF2
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
这不合适,因为连单词都没有用空格分隔。
步骤 2 - 尝试 pdf2htmlEX
建议我们尝试 https://github.com/coolwanglu/pdf2htmlEX 转换为 html,然后开发适当的选择器与 beautifulsoup4 一起使用。
pdf2htmlex 生成 html,其中每个单词都被标签包围,对我们没有任何帮助。
步骤 3 - 尝试 pdfminer.six
https://github.com/pdfminer/pdfminer.six
虽然还不完美,但好多了:
CIRCUITS IN THE CEREBELLAR CONTROL OF MOVEMENT
BY JOHN C. ECCLES
AMA/ERF INSTITUTE FOR BIOMEDICAL RESEARCH, CHICAGO
Communicated May 16, 1967
Neuroanatomists have generally recognized that the cerebellum provides
the greatest challenge in our initial efforts to discern functional
meaning in neuronal patterns because there is a stereotyped and simple
geometrical arrangement of its Presumably, it is for this reason that
there is the unique neuronal constituents. most refined knowledge of
microstructure that is available in the central nervous system. The
pioneer investigations of Ram6n y Cajall have led in recent times to
fascinating developments concerning microstructure, geometrical
arrangements, and numerical assessment.2
As shown in Figure 1,3 there are only two kinds of afferent fibers
conveying information to the cerebellum, the climbing fibers (cf) and
the mossy fibers (mf); and there is only one type of efferent fiber
from the cerebellum, the axons of the Purkinje cells (Pc), which
terminate in the cerebellar nuclei (cn) and otherwise largely in
Deiters' nucleus. The climbing fiber is uniquely distributed to a
single
FIG. 1.-Perspective drawing by Fox3 of a part of a folium of the
cerebellar cortex. The principal
components are shown in diagrammatic form, and are described in the
text.
336
VOL. 58, 1967
PHYSIOLOGY: J. C. ECCLES
337
然后我们可以运行输出这个代码:
import re
# Read In Text
fileName = "sample.txt"
pdfTextfile = open(fileName, "r")
pdfText = pdfTextfile.read()
# Split text into blocks separated by double line break.
blocks = pdfText.split("\n\n")
# Remove all new lines within blocks to remove arbitary line breaks
blocks = map(lambda x : x.replace("\n", ""), blocks)
# Which blocks are figure captions?
captions = []
for block in blocks:
if re.search('^fig', block, re.IGNORECASE):
captions.append(block)
# Done!
for caption in captions:
print(caption)
print()
这可能需要更多调整,因为 pdfminer.six 的输出不是很完美。
第 4 步 - 尝试 Tesseract
我很好奇 OCR 在这种情况下会有多好。首先将pdf转换为图像。然后安装以下内容:
sudo apt install tesseract-ocr
pip install pyocr
此代码将对图像执行 OCR。
from PIL import Image
import sys
import pyocr
import pyocr.builders
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
imageFile = "page_1.jpg"
txt = tool.image_to_string(
Image.open(imageFile),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
open("page_1.txt","w").write(txt)
这会产生更好的文本块,但有一些拼写错误:
CIRCUITS IN THE CEREBELLAR CONTROL OF MOVEMENT
By Joun C. Eccuss
AMA/ ERF INSTITUTE FOR BIOMEDICAL RESEARCH, CHICAGO
Communicated May 16, 1967
Neuroanatomists have generally recognized that the cerebellum provides
the greatest challenge in our initial efforts to discern functional
meaning in neuronal patterns because there is a stereotyped and simple
geometrical arrangement of its unique neuronal constituents.
Presumably, it is for this reason that there is the most refined
knowledge of microstructure that is available in the central nervous
system. The pioneer investigations of Ram6n y Cajal! have led in
recent times to fascinating developments concerning microstructure,
geometrical arrangements, and numerical assessment.’
As shown in Figure 1,* there are only two kinds of afferent fibers
conveying information to the cerebellum, the climbing fibers (cf) and
the mossy fibers (m/f); and there is only one type of efferent fiber
from the cerebellum, the axons of the Purkinje cells (Pc), which
terminate in the cerebellar nuclei (en) and otherwise largely in
Deiters’ nucleus. The climbing fiber is uniquely distributed to a
single
Fic. 1.—Perspective drawing by Fox? of a part of a folium of the
cerebellar cortex. The principal components are shown in diagrammatic
form, and are described in the text.
336
我想开发一个 Python 脚本,可以找到 PDF 中的所有图形标题。我想知道是否可以收集所有图形标题并将它们附加到一个数组中,因为它正在搜索新的图形标题。
我尝试搜索 "Figure" 这个词,然后抓取其中的整个句子,但效率不高,因为它无法找到标题中的所有句子,而是, 只有用句点分隔的句子。
编辑
以下是我打算使用的示例 PDF。如您所见,Fig.1 一词写在图像的正下方。
新编辑 这是使用 pdf2htmlEX 转换的完整 HTML 文件: https://drive.google.com/open?id=1hYriVrTlwmxR35A2Jy7mKoO4ns2oWe3Z
这个答案不完整,我们会在解决问题时更新它。
原始 PDF 副本:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC335638/pdf/pnas00677-0355.pdf
第 1 步 - 尝试 pypdf
# importing required modules
import PyPDF2
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
这不合适,因为连单词都没有用空格分隔。
步骤 2 - 尝试 pdf2htmlEX
建议我们尝试 https://github.com/coolwanglu/pdf2htmlEX 转换为 html,然后开发适当的选择器与 beautifulsoup4 一起使用。
pdf2htmlex 生成 html,其中每个单词都被标签包围,对我们没有任何帮助。
步骤 3 - 尝试 pdfminer.six
https://github.com/pdfminer/pdfminer.six
虽然还不完美,但好多了:
CIRCUITS IN THE CEREBELLAR CONTROL OF MOVEMENT
BY JOHN C. ECCLES
AMA/ERF INSTITUTE FOR BIOMEDICAL RESEARCH, CHICAGO
Communicated May 16, 1967
Neuroanatomists have generally recognized that the cerebellum provides the greatest challenge in our initial efforts to discern functional meaning in neuronal patterns because there is a stereotyped and simple geometrical arrangement of its Presumably, it is for this reason that there is the unique neuronal constituents. most refined knowledge of microstructure that is available in the central nervous system. The pioneer investigations of Ram6n y Cajall have led in recent times to fascinating developments concerning microstructure, geometrical arrangements, and numerical assessment.2
As shown in Figure 1,3 there are only two kinds of afferent fibers conveying information to the cerebellum, the climbing fibers (cf) and the mossy fibers (mf); and there is only one type of efferent fiber from the cerebellum, the axons of the Purkinje cells (Pc), which terminate in the cerebellar nuclei (cn) and otherwise largely in Deiters' nucleus. The climbing fiber is uniquely distributed to a single
FIG. 1.-Perspective drawing by Fox3 of a part of a folium of the cerebellar cortex. The principal
components are shown in diagrammatic form, and are described in the text.
336
VOL. 58, 1967
PHYSIOLOGY: J. C. ECCLES
337
然后我们可以运行输出这个代码:
import re
# Read In Text
fileName = "sample.txt"
pdfTextfile = open(fileName, "r")
pdfText = pdfTextfile.read()
# Split text into blocks separated by double line break.
blocks = pdfText.split("\n\n")
# Remove all new lines within blocks to remove arbitary line breaks
blocks = map(lambda x : x.replace("\n", ""), blocks)
# Which blocks are figure captions?
captions = []
for block in blocks:
if re.search('^fig', block, re.IGNORECASE):
captions.append(block)
# Done!
for caption in captions:
print(caption)
print()
这可能需要更多调整,因为 pdfminer.six 的输出不是很完美。
第 4 步 - 尝试 Tesseract
我很好奇 OCR 在这种情况下会有多好。首先将pdf转换为图像。然后安装以下内容:
sudo apt install tesseract-ocr
pip install pyocr
此代码将对图像执行 OCR。
from PIL import Image
import sys
import pyocr
import pyocr.builders
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
imageFile = "page_1.jpg"
txt = tool.image_to_string(
Image.open(imageFile),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
open("page_1.txt","w").write(txt)
这会产生更好的文本块,但有一些拼写错误:
CIRCUITS IN THE CEREBELLAR CONTROL OF MOVEMENT
By Joun C. Eccuss
AMA/ ERF INSTITUTE FOR BIOMEDICAL RESEARCH, CHICAGO
Communicated May 16, 1967
Neuroanatomists have generally recognized that the cerebellum provides the greatest challenge in our initial efforts to discern functional meaning in neuronal patterns because there is a stereotyped and simple geometrical arrangement of its unique neuronal constituents. Presumably, it is for this reason that there is the most refined knowledge of microstructure that is available in the central nervous system. The pioneer investigations of Ram6n y Cajal! have led in recent times to fascinating developments concerning microstructure, geometrical arrangements, and numerical assessment.’
As shown in Figure 1,* there are only two kinds of afferent fibers conveying information to the cerebellum, the climbing fibers (cf) and the mossy fibers (m/f); and there is only one type of efferent fiber from the cerebellum, the axons of the Purkinje cells (Pc), which terminate in the cerebellar nuclei (en) and otherwise largely in Deiters’ nucleus. The climbing fiber is uniquely distributed to a single
Fic. 1.—Perspective drawing by Fox? of a part of a folium of the cerebellar cortex. The principal components are shown in diagrammatic form, and are described in the text.
336