从 docx 文件中删除所有图像

Question

我已经在文档中搜索了 python-docx 和其他包，以及堆栈溢出，但找不到如何从 docx 文件中删除带有 python 的所有图像.

我的确切用例：我需要将数百个 word 文档转换为 "draft" 格式以供客户查看。这些草稿应与原始文件相同，但必须从中删除/编辑所有图像。

抱歉没有包括我尝试过的事情的例子，我尝试过的是数小时的研究，但没有提供任何信息。我发现了这个关于如何从 word 文件中提取图像的问题，但这并没有从实际文档中删除它们：Extract pictures from Word and Excel with Python

从那里和其他来源我发现 docx 文件可以作为简单的 zip 文件读取，我不知道这是否意味着可以 "re-zip" 没有图像不影响 docx 文件的完整性（编辑：只需删除图像即可，但会阻止 python-docx 继续使用此文件，因为缺少对图像的引用），但认为这可能是一条路径解决方案。

有什么想法吗？

Answer 1

我不知道这个库，但通过查看我发现的文档 this section about images。它提到目前无法插入内嵌图像以外的图像。如果这是您当前在文档中拥有的内容，我想您也可以通过查看 Document 对象来检索这些内容，然后将其删除？

文档说明here。

虽然不是重复的，但您可能还想查看，其中用户 "scanny" 解释了他如何使用库查找图像。

Answer 2

我不认为它目前在 python-docx 中实现。

Word 对象模型中的图片被定义为浮动形状或内联形状。 docx documentation 声明它仅支持内联形状。

Word Object Model for Inline Shapes supports a Delete() method, which should be accessible. However, it is not listed in the examples of InlineShapes and there is also a similar method for paragraphs. For paragraphs, there is an open feature request 添加此功能 - 这可以追溯到 2014 年！如果它没有添加到段落中，它将不能用于 InlineShapes，因为它们是作为离散段落实现的。

如果您的计算机装有 Word 并安装了 Python，则可以使用 win32com 执行此操作。这将允许您直接调用 Word 对象模型，从而访问 Delete() 方法。事实上，您可能会作弊 - 您可以调用查找和替换来清除图像，而不是滚动文档来获取每张图像。 This SO question 谈论 win32com 查找和替换：

import win32com.client
from os import getcwd, listdir

docs = [i for i in listdir('.') if i[-3:]=='doc' or i[-4:]=='docx'] #All Word file

FromTo = {"First Name":"John",
      "Last Name":"Smith"} #You can insert as many as you want

word = win32com.client.DispatchEx("Word.Application")
word.Visible = True #Keep comment after tests
word.DisplayAlerts = False
for doc in docs:
    word.Documents.Open('{}\{}'.format(getcwd(), doc))
    for From in FromTo.keys():
        word.Selection.Find.Text = From
        word.Selection.Find.Replacement.Text = FromTo[From]
        word.Selection.Find.Execute(Replace=2, Forward=True) #You made the mistake here=> Replace must be 2  
    name = doc.rsplit('.',1)[0]
    ext = doc.rsplit('.',1)[1]
    word.ActiveDocument.SaveAs('{}\{}_2.{}'.format(getcwd(), name, ext))

word.Quit() # releases Word object from memory

在这种情况下，因为我们需要图像，我们需要使用短代码 ^g 作为 find.Text 并使用空白作为替换。

word.Selection.Find
find.Text = "^g"
find.Replacement.Text = ""
find.Execute(Replace=1, Forward=True)

Answer 3

如果您的目标是编辑图像，那么我用于类似用例的这段代码可能会有用：

import sys
import zipfile
from PIL import Image, ImageFilter
import io

blur = ImageFilter.GaussianBlur(40)

def redact_images(filename):
    outfile = filename.replace(".docx", "_redacted.docx")
    with zipfile.ZipFile(filename) as inzip:
        with zipfile.ZipFile(outfile, "w") as outzip:
            for info in inzip.infolist():
                name = info.filename
                print(info)
                content = inzip.read(info)
                if name.endswith((".png", ".jpeg", ".gif")):
                        fmt = name.split(".")[-1]
                        img = Image.open(io.BytesIO(content))
                        img = img.convert().filter(blur)
                        outb = io.BytesIO()
                        img.save(outb, fmt)
                        content = outb.getvalue()
                        info.file_size = len(content)
                        info.CRC = zipfile.crc32(content)
                outzip.writestr(info, content)

这里我使用 PIL 对某些文件中的图像进行模糊处理，但可以使用任何其他合适的操作来代替模糊滤镜。这对我的用例非常有效。

从 docx 文件中删除所有图像

Remove all images from docx files

python

docx

python-docx