提取 .docx 数据、图像和结构

Question

美好的一天，

我有一项任务需要提取文档模板的特定部分（用于自动化目的）。虽然我能够在遍历过程中遍历并知道文档的当前位置（通过检查正则表达式、关键字等），但我无法提取：

文档的结构
检测文本之间的图像

我是否可以获得，例如，下面文档结构的数组？

['Paragraph1','Paragraph2','Image1','Image2','Paragraph3','Paragraph4','Image3','Image4']

我当前的实现如下所示：

from docx import Document

document = docx.Document('demo.docx')

text = []

for x in document.paragraphs:
    if x.text != '':
        text.append(x.text)

使用上面的代码，我能够从文档中获取所有文本数据，但是我无法检测到文本类型（Header 或 Normal），而且我无法检测到任何图像。我目前正在使用 python-docx.

我的主要问题是获取图像在文档中的位置（即段落之间），以便我可以使用提取的文本和图像重新创建另一个文档。这个任务要求我知道图像出现在文档中的什么位置，以及在新文档中的什么位置插入图像。

非常感谢任何帮助，谢谢:)

Answer 1

要提取段落和标题的结构，您可以使用 python-docx 中的 built-in objects。检查此代码。

from docx import Document
document = docx.Document('demo.docx')
text  = []
style = []
for x in document.paragraphs:
    if x.text != '':
        style.append(x.style.name)
        text.append(x.text)

使用 x.style.name 您可以获得文档中文本的样式。

您无法获取有关 python-docx 中图像的信息。为此，您需要解析 xml。检查 XML

的输出

for elem in document.element.getiterator():
    print(elem.tag)

如果您还需要什么，请告诉我。

要提取图像名称及其位置，请使用它。

tags = []
text = []
for t in doc.element.getiterator():
    if t.tag in ['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}r', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t','{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr','{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing']:
        if t.tag == '{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr':
            print('Picture Found: ',t.attrib['name'])
            tags.append('Picture')
            text.append(t.attrib['name'])
        elif t.text:
            tags.append('text')
            text.append(t.text)

您可以从文本列表中查看上一个和下一个文本，以及从标签列表中查看它们的标签。

提取 .docx 数据、图像和结构

Extracting .docx data, images and structure

python

python-docx