tesseract ocr 不适用于文本长度仅为 2 或更短的图像。适用于文本长度大于 3 的图像

Question

import pytesseract  
from PIL import Image

def textFromTesseractOCR(croppedImage):
    for i in range(14):
        text = pytesseract.image_to_string(croppedImage, lang = 'eng', boxes = False  ,config = '--psm '+ str(i) +' --oem 3')     
        print("PSM Mode", i)
        print("Text detected: ",text)

imgPath = "ImagePath"   #you can use image I have uploaded 
img = Image.open(imgPath)

textFromTesseractOCR(img)

我正在努力从 PDF 中提取 Table 数据。为此，我将 pdf 转换为 png。检测线，通过线相交确定 table，然后裁剪单个单元格以获取其文本。

一切正常，但 tesseract 不适用于文本长度为 2 或更短的单元格图像。

适用于此图片：

来自 tesseract 的结果：

不适用于此图像：

来自 tesseract 的结果：return 空字符串。对于文本长度为 2 或更短的数字，它也 return 为空。

我试过调整图像大小（我知道这是行不通的），也尝试过在图像上附加虚拟文本，但结果很糟糕（只对少数人有效，而且我不知道要附加的确切位置图像中的虚拟文本）

如果有人能帮助我，那就太好了。

Answer 1

我在给定的 2 张图像上尝试了运行 tesseract，但它没有 returns 较短的文本图像中的文本。

您可以尝试的另一件事是 "Train a machine learning model (probably neural net) to on alphabets, numbers and special character, then when you want to get text from image, feed that image to model and it will predict text/characters."

训练数据集看起来像：

一对（人物形象，'character'）。

对的第一个元素是模型的自变量。对的第二个元素是该图像中存在的相应字符。它将成为模型的因变量。

Answer 2

所以我终于找到了解决这种情况的方法。当图像仅包含 1 或 2 个长度的字符串（例如“1”或“25”）时，tesseract-OCR 给出空字符串的情况。

为了在这种情况下获得输出，我在原始图像上多次附加了相同的图像，以使其长度大于 2。例如，如果原始图像仅包含“3”，我附加了“3”图像（同一图像）4次以上，从而使其成为包含文本“33333”的图像。然后我们将此图像提供给 tesseract，它给出输出“33333”（大部分时间）。然后我们只需要在 Tesseract 输出的文本中用空白替换 space 并将结果字符串长度除以 5 即可得到我们希望从整个文本中输出的索引。

请参阅代码以供参考，希望对您有所帮助：

import pytesseract   ## pip3 install pytesseract

如果我们从 tesseract 输出中得到空字符串，则调用 tesseract 进行 OCR 或调用我们的解决方法代码的方法。

def textFromTesseractOCR(croppedImage):
    text = pytesseract.image_to_string(croppedImage)
    if text.strip() == '':    ### program that handles our problem
        if  0 not in croppedImage:
            return ""
        yDir = 3
        xDir = 3
        iterations = 4
        img = generate_blocks_dilation(croppedImage, yDir, xDir, iterations) 
        ## we dilation to get only the text portion of the image and not the whole image 
        kernelH = np.ones((1,5),np.uint8)
        kernelV = np.ones((5,1),np.uint8)
        img = cv2.dilate(img,kernelH,iterations = 1)
        img = cv2.dilate(img,kernelV,iterations = 1)
        image = cropOutMyImg(img, croppedImage)
        concateImg = np.concatenate((image, image), axis = 1)
        concateImg = np.concatenate((concateImg, image), axis = 1)
        concateImg = np.concatenate((concateImg, image), axis = 1)
        concateImg = np.concatenate((concateImg, image), axis = 1)
        textA = pytesseract.image_to_string(concateImg)
        textA = textA.strip()
        textA = textA.replace(" ","")
        textA = textA[0:int(len(textA)/5)]
        return textA
    return text

方法dilation.This方法用于仅扩大图像的文本区域

def generate_blocks_dilation(img, yDir, xDir, iterations):
    kernel = np.ones((yDir,xDir),np.uint8)
    ret,img = cv2.threshold(img, 0, 1, cv2.THRESH_BINARY_INV)
    return cv2.dilate(img,kernel,iterations = iterations)

裁剪图像膨胀部分的方法

def cropOutMyImg(gray, OrigImg):
    mask = np.zeros(gray.shape,np.uint8) # mask image the final image without small pieces
    _ , contours, hierarchy = cv2.findContours(gray,cv2.RETR_LIST,cv2.CHAIN_APPROX_SIMPLE)     
    for cnt in contours:
        if cv2.contourArea(cnt)!=0:

        cv2.drawContours(mask,[cnt],0,255,-1) # the [] around cnt and 3rd argument 0 mean only the particular contour is drawn
        # Build a ROI to crop the QR
        x,y,w,h = cv2.boundingRect(cnt)
        roi=mask[y:y+h,x:x+w]
        # crop the original QR based on the ROI
        QR_crop = OrigImg[y:y+h,x:x+w]
        # use cropped mask image (roi) to get rid of all small pieces
        QR_final = QR_crop * (roi/255)
return QR_final

tesseract ocr 不适用于文本长度仅为 2 或更短的图像。适用于文本长度大于 3 的图像

tesseract ocr is not working on image which have text length of only 2 or less. Works fine for Image with text length greater than 3

ocr

tesseract

python-3.x

cv2

请参阅代码以供参考，希望对您有所帮助：

如果我们从 tesseract 输出中得到空字符串，则调用 tesseract 进行 OCR 或调用我们的解决方法代码的方法。

方法dilation.This方法用于仅扩大图像的文本区域

裁剪图像膨胀部分的方法