从报纸文章中提取古吉拉特语（Google 输入工具支持的语言）文本

Question

我想从报纸文章（照片/数字副本）中提取（古吉拉特语）文本，

目前我手动将文章裁剪成小块，因为大多数工具都是水平提取文本，这不适用于报纸文章的分栏结构。

然后，我将所有图片垂直合并，并上传到google驱动器。

然后，我用 google 文档打开图像，在那里我得到了图像和准确度很高的文本（因为 Google 输入工具支持古吉拉特语）。

我正在尝试自动执行上述所有任务，以便我只提供报纸文章作为输入并获得最终文本输出。

我听说过 python 自动化脚本，但不知道如何使用它。

所以，最终我需要连续执行 2 个任务： (1) 从报纸文章中按顺序识别块， (2) 图片-> 文字转换

这里是文章示例图片：

帮我 "How I can speed up my task ?"

Answer 1

首先你需要熟悉openCV。从这里开始是基本概念：

# convert the image to binary
import cv2
image = cv2.imread('news.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # convert2grayscale
(thresh, binary) = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU) # convert2binary
cv2.imshow('binary', binary)
(_, contours, _) = cv2.findContours(~binary,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE) 
# find contours
for contour in contours:
    """
    draw a rectangle around those contours on main image
    """
    [x,y,w,h] = cv2.boundingRect(contour)
    cv2.rectangle(image, (x,y), (x+w,y+h), (0, 255, 0), 1)
cv2.imshow('contour', image)

之后阅读 Python-tesseract（用于 python 的光学字符识别 (OCR) 工具）。

我提到了一些可能对您有帮助的有用资源：

article-extraction-from-newspaper-image-in-python-and-opencv
finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy
opencv-ocr-and-text-recognition-with-tesseract

从报纸文章中提取古吉拉特语（Google 输入工具支持的语言）文本

Extract Gujarati (Language Supported by Google Input Tools) Text from Newspaper Articles

python

text

extract

google-docs

google-docs-api