如何使用 docx 库从 MS Word 文档中的 table 中提取图像?

How to extract image from table in MS Word document with docx library?

我正在开发一个需要从 MS Word 文档中提取两个图像以在另一个文档中使用它们的程序。我知道图像的位置(文档中的第一个 table),但是当我尝试从 table 中提取任何信息(即使只是纯文本)时,我得到空单元格。

Here is the Word document 我想从中提取图像。我想从第一页(第一页 table,第 0 行和第 1 行,第 2 列)中提取 'Rentel' 个图像。


我尝试过以下代码:

from docxtpl import DocxTemplate

source_document = DocxTemplate("Source document.docx")

# It doesn't really matter which rows or columns I use for the cells, everything is empty
print(source_document.tables[0].cell(0,0).text)

这只是给我空行...


我在 and 上看到问题可能出在 "contained in a wrapper element that Python Docx cannot read"。他们建议更改源文档,但我希望能够 select 以前使用与源文档相同的模板创建的任何文档(因此这些文档也包含相同的问题,我不能单独更改每个文档) .所以 Python-only 解决方案真的是我能想到的解决问题的唯一方法。


因为我也只想要那两个特定的图像,所以通过解压缩 Word 文件从 xml 中提取任何随机图像并不适合我的解决方案,除非我知道我需要从哪个图像名称中提取解压缩的 Word 文件夹。


我真的希望它能工作,因为它是我论文的一部分(而且我只是一名机电工程师,所以我对软件了解不多)。


[编辑]:这是 first image (source_document.tables[0].cell(0,2)._tc.xml) and here it is for the second image (source_document.tables[0].cell(1,2)._tc.xml). I noticed however that taking (0,2) as row and column value, gives me all the rows in column 2 within the first "visible" table. Cell (1,2) gives me all the rows in column 2 within the second "visible" table.

的 xml 代码

如果使用 Python Docx 无法直接解决问题,是否可以在 XML 代码中搜索图像名称或 ID 或其他内容,然后使用此方法添加图像ID/name 与 Python Docx?

好吧,首先跳出的是您发布的两个单元格(w:tc 元素)每个 包含 一个嵌套的 table。这可能不寻常,但肯定是有效的组合。也许他们这样做是为了在图片下方的单元格中添加标题或其他内容。

要访问嵌套的 table,您必须执行如下操作:

outer_cell = source_document.tables[0].cell(0,2)
nested_table = outer_cell.tables[0]
inner_cell_1 = nested_table.cell(0, 0)
print(inner_cell_1.text)
# ---etc....---

我不确定这是否能解决您的整个问题,但令我印象深刻的是,这最终是两个或更多问题,第一个是:"Why isn't my table cell showing up?",第二个可能是 "How do I get an image out of a table cell?" (一旦您真正找到了有问题的单元格)。

对于遇到同样问题的人,这是帮助我解决问题的代码:

首先,我使用以下方法从 table 中提取嵌套单元格:

@staticmethod
def get_nested_cell(table, outer_row, outer_column, inner_row, inner_column):
    """
        Returns the nested cell (table inside a table) of the *document*

        :argument
            table: [docx.Table] outer table from which to get the nested table
            outer_row: [int] row of the outer table in which the nested table is
            outer_column: [int] column of the outer table in which the nested table is
            inner_row: [int] row in the nested table from which to get the nested cell
            inner_column: [int] column in the nested table from which to get the nested cell
        :return
            inner_cell: [docx.Cell] nested cell
    """
    # Get the global first cell
    outer_cell = table.cell(outer_row, outer_column)
    nested_table = outer_cell.tables[0]
    inner_cell = nested_table.cell(inner_row, inner_column)

    return inner_cell

使用此单元格,我可以获得 xml 代码并从该 xml 代码中提取图像。注:

  • 我没有设置图片的宽度和高度,因为我希望它是一样的
  • replace_logos_from_source 方法中,我知道我想从中获取徽标的 table 是 'tables[0]' 并且嵌套的 table 在 outer_row 和 outer_column '0',所以我只是在 get_nested_cell 方法中填充它,而没有向 replace_logos_from_source
  • 添加额外的参数
def replace_logos_from_source(self, source_document, target_document, inner_row, inner_column):
    """
        Replace the employer and client logo from the *source_document* to the *target_document*. Since the table
        in which the logos are placed are nested tables, the source and target cells with *inner_row* and
        *inner_column* are first extracted from the nested table.

        :argument
            source_document: [DocxTemplate] document from which to extract the image
            target_document: [DocxTemplate] document to which to add the extracted image
            inner_row: [int] row in the nested table from which to get the image
            inner_column: [int] column in the nested table from which to get the image
        :return
            Nothing
    """
    # Get the target and source cell (I know that the table where I want to get the logos from is 'tables[0]' and that the nested table is in outer_row and outer_column '0', so I just filled it in without adding extra arguments to the method)
    target_cell = self.get_nested_cell(target_document.tables[0], 0, 0, inner_row, inner_column)
    source_cell = self.get_nested_cell(source_document.tables[0], 0, 0, inner_row, inner_column)

    # Get the xml code of the inner cell
    inner_cell_xml = source_cell._tc.xml

    # Get the image from the xml code
    image_stream = self.get_image_from_xml(source_document, inner_cell_xml)

    # Add the image to the target cell
    paragraph = target_cell.paragraphs[0]
    if image_stream:  # If not None (image exists)
        run = paragraph.add_run()
        run.add_picture(image_stream)
    else:
        # Set the target cell text equal to the source cell text
        paragraph.add_run(source_cell.text)

@staticmethod
def get_image_from_xml(source_document, xml_code):
    """
        Returns the rId for an image in the *xml_code*

        :argument
            xml_code: [string] xml code from which to extract the image from
        :return
            image_stream: [BytesIO stream] the image to find
            None if no image exists in the xml_file

    """
    # Parse the xml code for the blip
    xml_parser = minidom.parseString(xml_code)

    items = xml_parser.getElementsByTagName('a:blip')

    # Check if an image exists
    if items:
        # Extract the rId of the image
        rId = items[0].attributes['r:embed'].value

        # Get the blob of the image
        source_document_part = source_document.part
        image_part = source_document_part.related_parts[rId]
        image_bytes = image_part._blob

        # Write the image bytes to a file (or BytesIO stream) and feed it to document.add_picture(), maybe:
        image_stream = BytesIO(image_bytes)

        return image_stream
    # If no image exists
    else:
        return None

为了调用该方法,我使用了:

# Replace the employer and client logos
self.replace_logos_from_source(self.source_document, self.template_doc, 0, 2)  # Employer logo
self.replace_logos_from_source(self.source_document, self.template_doc, 1, 2)  # Client logo