python-docx 获取单词位置和属性

Question

我正在寻找一种方法来提取文档中每个单词的位置 (x, y) 和属性（字体/大小）。

从 python-docx 文档中，我知道：

Conceptually, Word documents have two layers, a text layer and a drawing layer. In the text layer, text objects are flowed from left to right and from top to bottom, starting a new page when the prior one is filled. In the drawing layer, drawing objects, called shapes, are placed at arbitrary positions. These are sometimes referred to as floating shapes.

A picture is a shape that can appear in either the text or drawing layer. When it appears in the text layer it is called an inline shape, or more specifically, an inline picture.

[...] At the time of writing, python-docx only supports inline pictures.

然而，即使它不是它的要点，我想知道是否存在类似的东西：

from docx import Document
main_file = Document("/tmp/file.docx")
for paragraph in main_file.paragraphs:
    for word in paragraph.text:  # <= Non-existing (yet wished) functionnalities, IMHO
        print(word.x, word.y)  # <= Non-existing (yet wished) functionnalities, IMHO

有人有想法吗？最好，亚瑟

Answer 1

for word in paragraph.text:  # <= Non-existing (yet wished) functionalities, IMHO

此功能直接在 Python 库中作为 str.split() 提供。这些可以很容易地组成：

for word in paragraph.text.split():
    ...

关于

print(word.x, word.y)  # <= Non-existing (yet wished) functionnalities, IMHO

我认为可以肯定地说此功能永远不会出现在 python-docx 中，即使出现也不会像这样。

这样的功能将执行的操作是向页面呈现器询问呈现器要放置这些字符的位置。 python-docx没有渲染引擎（因为它不渲染文档）；它只是一个奇特的 XML 编辑器，可以选择性地修改 WordprocessingML 词汇表中的 XML 文件。

可能可以从 Word 本身获取这些值，因为 Word 确实有渲染引擎（用于屏幕显示和打印）。

如果有这样的函数，我希望它会在该段内获取一个段落和一个字符偏移量，或者更多类似的东西，比如 document.position(paragraph, offset=42) 或者 paragraph.position(offset=42)。

python-docx 获取单词位置和属性

python-docx get words position and attributes

docx

python-docx