cv2直接tesseract不保存

Question

import pytesseract
from pdf2image import convert_from_path, convert_from_bytes
import cv2,numpy
def pil_to_cv2(image):
    open_cv_image = numpy.array(image)
    return open_cv_image[:, :, ::-1].copy() 


path='OriginalsFile.pdf'
images = convert_from_path(path)
cv_h=[pil_to_cv2(i) for i in images]
img_header = cv_h[0][:160,:]
#print(pytesseract.image_to_string(Image.open('test.png'))) I only found this in tesseract docs

你好，请问有没有办法不用保存直接用pytesseract读取img_header，

pytesseract docs

Answer 1

pytesseract.image_to_string() 输入格式

如文档所述，pytesseract.image_to_string() 需要 PIL 图像作为输入。所以你可以很容易地将你的CV图片转换成PIL，就像这样：

from PIL import Image
... (your code)
print(pytesseract.image_to_string(Image.fromarray(img_header)))

如果你真的不想使用PIL！

见： https://github.com/madmaze/pytesseract/blob/master/src/pytesseract.py

pytesseract 是运行 tesseract 命令 def run_and_get_output() 行的简单包装器，您会看到它将图像保存到一个临时文件中，然后将 tesseract 的地址提供给运行.

因此，您可以对 opencv 执行相同操作，只需重写 pytesseract only .py 文件以使用 opencv 执行此操作；我没有看到任何性能改进。

Answer 2

fromarray 函数允许您将 PIL 文档加载到 tesseract 中，而无需将文档保存到磁盘，但您还应确保不要将 pil 图像列表发送到 tesseract 中。 convert_from_path函数可以生成一个pil图片列表，如果一个pdf文档包含多个页面，因此你需要将每个页面单独发送到tesseract。

import pytesseract
from pdf2image import convert_from_path
import cv2, numpy

def pil_to_cv2(image):
    open_cv_image = numpy.array(image)
    return open_cv_image[:, :, ::-1].copy()

doc = convert_from_path(path)

for page_number, page_data in enumerate(doc):
    cv_h= pil_to_cv2(page_data)
    img_header = cv_h[:160,:]
    print(f"{page_number} - {pytesseract.image_to_string(Image.fromarray(img_header))}")

cv2直接tesseract不保存

cv2 to tesseract directly without saving

python

image

python-tesseract

pytesseract.image_to_string() 输入格式

如果你真的不想使用PIL！