如何使用 docx 库从 MS Word 文档中的 table 中提取图像?
How to extract image from table in MS Word document with docx library?
我正在开发一个需要从 MS Word 文档中提取两个图像以在另一个文档中使用它们的程序。我知道图像的位置(文档中的第一个 table),但是当我尝试从 table 中提取任何信息(即使只是纯文本)时,我得到空单元格。
Here is the Word document 我想从中提取图像。我想从第一页(第一页 table,第 0 行和第 1 行,第 2 列)中提取 'Rentel' 个图像。
我尝试过以下代码:
from docxtpl import DocxTemplate
source_document = DocxTemplate("Source document.docx")
# It doesn't really matter which rows or columns I use for the cells, everything is empty
print(source_document.tables[0].cell(0,0).text)
这只是给我空行...
我在 and 上看到问题可能出在 "contained in a wrapper element that Python Docx cannot read"。他们建议更改源文档,但我希望能够 select 以前使用与源文档相同的模板创建的任何文档(因此这些文档也包含相同的问题,我不能单独更改每个文档) .所以 Python-only 解决方案真的是我能想到的解决问题的唯一方法。
因为我也只想要那两个特定的图像,所以通过解压缩 Word 文件从 xml 中提取任何随机图像并不适合我的解决方案,除非我知道我需要从哪个图像名称中提取解压缩的 Word 文件夹。
我真的希望它能工作,因为它是我论文的一部分(而且我只是一名机电工程师,所以我对软件了解不多)。
[编辑]:这是 first image (source_document.tables[0].cell(0,2)._tc.xml
) and here it is for the second image (source_document.tables[0].cell(1,2)._tc.xml
). I noticed however that taking (0,2) as row and column value, gives me all the rows in column 2 within the first "visible" table. Cell (1,2) gives me all the rows in column 2 within the second "visible" table.
的 xml 代码
如果使用 Python Docx 无法直接解决问题,是否可以在 XML 代码中搜索图像名称或 ID 或其他内容,然后使用此方法添加图像ID/name 与 Python Docx?
好吧,首先跳出的是您发布的两个单元格(w:tc
元素)每个 包含 一个嵌套的 table。这可能不寻常,但肯定是有效的组合。也许他们这样做是为了在图片下方的单元格中添加标题或其他内容。
要访问嵌套的 table,您必须执行如下操作:
outer_cell = source_document.tables[0].cell(0,2)
nested_table = outer_cell.tables[0]
inner_cell_1 = nested_table.cell(0, 0)
print(inner_cell_1.text)
# ---etc....---
我不确定这是否能解决您的整个问题,但令我印象深刻的是,这最终是两个或更多问题,第一个是:"Why isn't my table cell showing up?",第二个可能是 "How do I get an image out of a table cell?" (一旦您真正找到了有问题的单元格)。
对于遇到同样问题的人,这是帮助我解决问题的代码:
首先,我使用以下方法从 table 中提取嵌套单元格:
@staticmethod
def get_nested_cell(table, outer_row, outer_column, inner_row, inner_column):
"""
Returns the nested cell (table inside a table) of the *document*
:argument
table: [docx.Table] outer table from which to get the nested table
outer_row: [int] row of the outer table in which the nested table is
outer_column: [int] column of the outer table in which the nested table is
inner_row: [int] row in the nested table from which to get the nested cell
inner_column: [int] column in the nested table from which to get the nested cell
:return
inner_cell: [docx.Cell] nested cell
"""
# Get the global first cell
outer_cell = table.cell(outer_row, outer_column)
nested_table = outer_cell.tables[0]
inner_cell = nested_table.cell(inner_row, inner_column)
return inner_cell
使用此单元格,我可以获得 xml 代码并从该 xml 代码中提取图像。注:
- 我没有设置图片的宽度和高度,因为我希望它是一样的
- 在
replace_logos_from_source
方法中,我知道我想从中获取徽标的 table 是 'tables[0]' 并且嵌套的 table 在 outer_row 和 outer_column '0',所以我只是在 get_nested_cell
方法中填充它,而没有向 replace_logos_from_source
添加额外的参数
def replace_logos_from_source(self, source_document, target_document, inner_row, inner_column):
"""
Replace the employer and client logo from the *source_document* to the *target_document*. Since the table
in which the logos are placed are nested tables, the source and target cells with *inner_row* and
*inner_column* are first extracted from the nested table.
:argument
source_document: [DocxTemplate] document from which to extract the image
target_document: [DocxTemplate] document to which to add the extracted image
inner_row: [int] row in the nested table from which to get the image
inner_column: [int] column in the nested table from which to get the image
:return
Nothing
"""
# Get the target and source cell (I know that the table where I want to get the logos from is 'tables[0]' and that the nested table is in outer_row and outer_column '0', so I just filled it in without adding extra arguments to the method)
target_cell = self.get_nested_cell(target_document.tables[0], 0, 0, inner_row, inner_column)
source_cell = self.get_nested_cell(source_document.tables[0], 0, 0, inner_row, inner_column)
# Get the xml code of the inner cell
inner_cell_xml = source_cell._tc.xml
# Get the image from the xml code
image_stream = self.get_image_from_xml(source_document, inner_cell_xml)
# Add the image to the target cell
paragraph = target_cell.paragraphs[0]
if image_stream: # If not None (image exists)
run = paragraph.add_run()
run.add_picture(image_stream)
else:
# Set the target cell text equal to the source cell text
paragraph.add_run(source_cell.text)
@staticmethod
def get_image_from_xml(source_document, xml_code):
"""
Returns the rId for an image in the *xml_code*
:argument
xml_code: [string] xml code from which to extract the image from
:return
image_stream: [BytesIO stream] the image to find
None if no image exists in the xml_file
"""
# Parse the xml code for the blip
xml_parser = minidom.parseString(xml_code)
items = xml_parser.getElementsByTagName('a:blip')
# Check if an image exists
if items:
# Extract the rId of the image
rId = items[0].attributes['r:embed'].value
# Get the blob of the image
source_document_part = source_document.part
image_part = source_document_part.related_parts[rId]
image_bytes = image_part._blob
# Write the image bytes to a file (or BytesIO stream) and feed it to document.add_picture(), maybe:
image_stream = BytesIO(image_bytes)
return image_stream
# If no image exists
else:
return None
为了调用该方法,我使用了:
# Replace the employer and client logos
self.replace_logos_from_source(self.source_document, self.template_doc, 0, 2) # Employer logo
self.replace_logos_from_source(self.source_document, self.template_doc, 1, 2) # Client logo
我正在开发一个需要从 MS Word 文档中提取两个图像以在另一个文档中使用它们的程序。我知道图像的位置(文档中的第一个 table),但是当我尝试从 table 中提取任何信息(即使只是纯文本)时,我得到空单元格。
Here is the Word document 我想从中提取图像。我想从第一页(第一页 table,第 0 行和第 1 行,第 2 列)中提取 'Rentel' 个图像。
我尝试过以下代码:
from docxtpl import DocxTemplate
source_document = DocxTemplate("Source document.docx")
# It doesn't really matter which rows or columns I use for the cells, everything is empty
print(source_document.tables[0].cell(0,0).text)
这只是给我空行...
我在
因为我也只想要那两个特定的图像,所以通过解压缩 Word 文件从 xml 中提取任何随机图像并不适合我的解决方案,除非我知道我需要从哪个图像名称中提取解压缩的 Word 文件夹。
我真的希望它能工作,因为它是我论文的一部分(而且我只是一名机电工程师,所以我对软件了解不多)。
[编辑]:这是 first image (source_document.tables[0].cell(0,2)._tc.xml
) and here it is for the second image (source_document.tables[0].cell(1,2)._tc.xml
). I noticed however that taking (0,2) as row and column value, gives me all the rows in column 2 within the first "visible" table. Cell (1,2) gives me all the rows in column 2 within the second "visible" table.
如果使用 Python Docx 无法直接解决问题,是否可以在 XML 代码中搜索图像名称或 ID 或其他内容,然后使用此方法添加图像ID/name 与 Python Docx?
好吧,首先跳出的是您发布的两个单元格(w:tc
元素)每个 包含 一个嵌套的 table。这可能不寻常,但肯定是有效的组合。也许他们这样做是为了在图片下方的单元格中添加标题或其他内容。
要访问嵌套的 table,您必须执行如下操作:
outer_cell = source_document.tables[0].cell(0,2)
nested_table = outer_cell.tables[0]
inner_cell_1 = nested_table.cell(0, 0)
print(inner_cell_1.text)
# ---etc....---
我不确定这是否能解决您的整个问题,但令我印象深刻的是,这最终是两个或更多问题,第一个是:"Why isn't my table cell showing up?",第二个可能是 "How do I get an image out of a table cell?" (一旦您真正找到了有问题的单元格)。
对于遇到同样问题的人,这是帮助我解决问题的代码:
首先,我使用以下方法从 table 中提取嵌套单元格:
@staticmethod
def get_nested_cell(table, outer_row, outer_column, inner_row, inner_column):
"""
Returns the nested cell (table inside a table) of the *document*
:argument
table: [docx.Table] outer table from which to get the nested table
outer_row: [int] row of the outer table in which the nested table is
outer_column: [int] column of the outer table in which the nested table is
inner_row: [int] row in the nested table from which to get the nested cell
inner_column: [int] column in the nested table from which to get the nested cell
:return
inner_cell: [docx.Cell] nested cell
"""
# Get the global first cell
outer_cell = table.cell(outer_row, outer_column)
nested_table = outer_cell.tables[0]
inner_cell = nested_table.cell(inner_row, inner_column)
return inner_cell
使用此单元格,我可以获得 xml 代码并从该 xml 代码中提取图像。注:
- 我没有设置图片的宽度和高度,因为我希望它是一样的
- 在
replace_logos_from_source
方法中,我知道我想从中获取徽标的 table 是 'tables[0]' 并且嵌套的 table 在 outer_row 和 outer_column '0',所以我只是在get_nested_cell
方法中填充它,而没有向replace_logos_from_source
添加额外的参数
def replace_logos_from_source(self, source_document, target_document, inner_row, inner_column):
"""
Replace the employer and client logo from the *source_document* to the *target_document*. Since the table
in which the logos are placed are nested tables, the source and target cells with *inner_row* and
*inner_column* are first extracted from the nested table.
:argument
source_document: [DocxTemplate] document from which to extract the image
target_document: [DocxTemplate] document to which to add the extracted image
inner_row: [int] row in the nested table from which to get the image
inner_column: [int] column in the nested table from which to get the image
:return
Nothing
"""
# Get the target and source cell (I know that the table where I want to get the logos from is 'tables[0]' and that the nested table is in outer_row and outer_column '0', so I just filled it in without adding extra arguments to the method)
target_cell = self.get_nested_cell(target_document.tables[0], 0, 0, inner_row, inner_column)
source_cell = self.get_nested_cell(source_document.tables[0], 0, 0, inner_row, inner_column)
# Get the xml code of the inner cell
inner_cell_xml = source_cell._tc.xml
# Get the image from the xml code
image_stream = self.get_image_from_xml(source_document, inner_cell_xml)
# Add the image to the target cell
paragraph = target_cell.paragraphs[0]
if image_stream: # If not None (image exists)
run = paragraph.add_run()
run.add_picture(image_stream)
else:
# Set the target cell text equal to the source cell text
paragraph.add_run(source_cell.text)
@staticmethod
def get_image_from_xml(source_document, xml_code):
"""
Returns the rId for an image in the *xml_code*
:argument
xml_code: [string] xml code from which to extract the image from
:return
image_stream: [BytesIO stream] the image to find
None if no image exists in the xml_file
"""
# Parse the xml code for the blip
xml_parser = minidom.parseString(xml_code)
items = xml_parser.getElementsByTagName('a:blip')
# Check if an image exists
if items:
# Extract the rId of the image
rId = items[0].attributes['r:embed'].value
# Get the blob of the image
source_document_part = source_document.part
image_part = source_document_part.related_parts[rId]
image_bytes = image_part._blob
# Write the image bytes to a file (or BytesIO stream) and feed it to document.add_picture(), maybe:
image_stream = BytesIO(image_bytes)
return image_stream
# If no image exists
else:
return None
为了调用该方法,我使用了:
# Replace the employer and client logos
self.replace_logos_from_source(self.source_document, self.template_doc, 0, 2) # Employer logo
self.replace_logos_from_source(self.source_document, self.template_doc, 1, 2) # Client logo