如何使用 docx 库从 MS Word 文档中的 table 中提取图像？

Question

我正在开发一个需要从 MS Word 文档中提取两个图像以在另一个文档中使用它们的程序。我知道图像的位置（文档中的第一个 table），但是当我尝试从 table 中提取任何信息（即使只是纯文本）时，我得到空单元格。

Here is the Word document 我想从中提取图像。我想从第一页（第一页 table，第 0 行和第 1 行，第 2 列）中提取 'Rentel' 个图像。

我尝试过以下代码：

from docxtpl import DocxTemplate

source_document = DocxTemplate("Source document.docx")

# It doesn't really matter which rows or columns I use for the cells, everything is empty
print(source_document.tables[0].cell(0,0).text)

这只是给我空行...

我在 and 上看到问题可能出在 "contained in a wrapper element that Python Docx cannot read"。他们建议更改源文档，但我希望能够 select 以前使用与源文档相同的模板创建的任何文档（因此这些文档也包含相同的问题，我不能单独更改每个文档） .所以 Python-only 解决方案真的是我能想到的解决问题的唯一方法。

因为我也只想要那两个特定的图像，所以通过解压缩 Word 文件从 xml 中提取任何随机图像并不适合我的解决方案，除非我知道我需要从哪个图像名称中提取解压缩的 Word 文件夹。

我真的希望它能工作，因为它是我论文的一部分（而且我只是一名机电工程师，所以我对软件了解不多）。

[编辑]：这是 first image (source_document.tables[0].cell(0,2)._tc.xml) and here it is for the second image (source_document.tables[0].cell(1,2)._tc.xml). I noticed however that taking (0,2) as row and column value, gives me all the rows in column 2 within the first "visible" table. Cell (1,2) gives me all the rows in column 2 within the second "visible" table.

的 xml 代码

如果使用 Python Docx 无法直接解决问题，是否可以在 XML 代码中搜索图像名称或 ID 或其他内容，然后使用此方法添加图像ID/name 与 Python Docx?

Answer 1

好吧，首先跳出的是您发布的两个单元格（w:tc 元素）每个包含一个嵌套的 table。这可能不寻常，但肯定是有效的组合。也许他们这样做是为了在图片下方的单元格中添加标题或其他内容。

要访问嵌套的 table，您必须执行如下操作：

outer_cell = source_document.tables[0].cell(0,2)
nested_table = outer_cell.tables[0]
inner_cell_1 = nested_table.cell(0, 0)
print(inner_cell_1.text)
# ---etc....---

我不确定这是否能解决您的整个问题，但令我印象深刻的是，这最终是两个或更多问题，第一个是："Why isn't my table cell showing up?"，第二个可能是 "How do I get an image out of a table cell?" （一旦您真正找到了有问题的单元格）。

Answer 2

对于遇到同样问题的人，这是帮助我解决问题的代码：

首先，我使用以下方法从 table 中提取嵌套单元格：

@staticmethod
def get_nested_cell(table, outer_row, outer_column, inner_row, inner_column):
    """
        Returns the nested cell (table inside a table) of the *document*

        :argument
            table: [docx.Table] outer table from which to get the nested table
            outer_row: [int] row of the outer table in which the nested table is
            outer_column: [int] column of the outer table in which the nested table is
            inner_row: [int] row in the nested table from which to get the nested cell
            inner_column: [int] column in the nested table from which to get the nested cell
        :return
            inner_cell: [docx.Cell] nested cell
    """
    # Get the global first cell
    outer_cell = table.cell(outer_row, outer_column)
    nested_table = outer_cell.tables[0]
    inner_cell = nested_table.cell(inner_row, inner_column)

    return inner_cell

使用此单元格，我可以获得 xml 代码并从该 xml 代码中提取图像。注：

我没有设置图片的宽度和高度，因为我希望它是一样的
在 replace_logos_from_source 方法中，我知道我想从中获取徽标的 table 是 'tables[0]' 并且嵌套的 table 在 outer_row 和 outer_column '0'，所以我只是在 get_nested_cell 方法中填充它，而没有向 replace_logos_from_source

def replace_logos_from_source(self, source_document, target_document, inner_row, inner_column):
    """
        Replace the employer and client logo from the *source_document* to the *target_document*. Since the table
        in which the logos are placed are nested tables, the source and target cells with *inner_row* and
        *inner_column* are first extracted from the nested table.

        :argument
            source_document: [DocxTemplate] document from which to extract the image
            target_document: [DocxTemplate] document to which to add the extracted image
            inner_row: [int] row in the nested table from which to get the image
            inner_column: [int] column in the nested table from which to get the image
        :return
            Nothing
    """
    # Get the target and source cell (I know that the table where I want to get the logos from is 'tables[0]' and that the nested table is in outer_row and outer_column '0', so I just filled it in without adding extra arguments to the method)
    target_cell = self.get_nested_cell(target_document.tables[0], 0, 0, inner_row, inner_column)
    source_cell = self.get_nested_cell(source_document.tables[0], 0, 0, inner_row, inner_column)

    # Get the xml code of the inner cell
    inner_cell_xml = source_cell._tc.xml

    # Get the image from the xml code
    image_stream = self.get_image_from_xml(source_document, inner_cell_xml)

    # Add the image to the target cell
    paragraph = target_cell.paragraphs[0]
    if image_stream:  # If not None (image exists)
        run = paragraph.add_run()
        run.add_picture(image_stream)
    else:
        # Set the target cell text equal to the source cell text
        paragraph.add_run(source_cell.text)

@staticmethod
def get_image_from_xml(source_document, xml_code):
    """
        Returns the rId for an image in the *xml_code*

        :argument
            xml_code: [string] xml code from which to extract the image from
        :return
            image_stream: [BytesIO stream] the image to find
            None if no image exists in the xml_file

    """
    # Parse the xml code for the blip
    xml_parser = minidom.parseString(xml_code)

    items = xml_parser.getElementsByTagName('a:blip')

    # Check if an image exists
    if items:
        # Extract the rId of the image
        rId = items[0].attributes['r:embed'].value

        # Get the blob of the image
        source_document_part = source_document.part
        image_part = source_document_part.related_parts[rId]
        image_bytes = image_part._blob

        # Write the image bytes to a file (or BytesIO stream) and feed it to document.add_picture(), maybe:
        image_stream = BytesIO(image_bytes)

        return image_stream
    # If no image exists
    else:
        return None

为了调用该方法，我使用了：

# Replace the employer and client logos
self.replace_logos_from_source(self.source_document, self.template_doc, 0, 2)  # Employer logo
self.replace_logos_from_source(self.source_document, self.template_doc, 1, 2)  # Client logo

如何使用 docx 库从 MS Word 文档中的 table 中提取图像？

How to extract image from table in MS Word document with docx library?

python

xml

ms-word

docx

python-docx