提取 MS Word 文档格式元素以及原始文本信息

Question

在 this post @mikemaccana describes how to use python-docx to extract raw text data from an MS Word document from within python. I'd like to go one step further. Instead of simple extracting the raw text information, can I also use this module to harvest information about font face (e.g. bold versus italic) or font size (e.g. 12 versus 18pt). The closest I was able to come was this post 中询问有关使用此模块提取突出显示的文本条目的问题。

看起来有点抽象，我不太清楚这里发生了什么。在 python 中是否有更直接的方法从 Word 文档中提取格式信息？通过快速文档模板：

Here the first line is a large header with one sentence.

The second line is slightly smaller. It also has two sentences.

Even smaller. But that's not all. This line has three sentences.

And finally here's a regular line of unbolded text.

如果我们将这四行称为我的 word 文档，我想编写一个解析函数，将其命名为 doc_parser，即 returns 类似 的东西 以下：

>>>> doc_data = doc_parser(path_to_example_doc)
>>>> print(doc_data)
[1] [{'font': 18, 'face': 'bold', 'n_sentence': 1}, 
{'font': 16, 'face': 'bold', 'n_sentence': 2}, 
{'font': 14, 'face': 'bold', 'n_sentence': 3}, 
{'font': 12, 'face': 'plain', 'n_sentence': 1}]

Answer 1

字符级格式化 ("font") 属性在运行级可用。一个段落由运行组成。所以你可以通过降低到那个级别来获得你想要的东西，比如：

for paragraph in document.paragraphs:
    for run in paragraph.runs:
        font = run.font
        is_bold = font.bold
        etc.

您可能遇到的最大问题是运行只知道直接应用到它的格式。如果它看起来是因为应用了 style，那么您将不得不查询样式（它也有一个字体对象）以查看它具有哪些属性。

请注意，Mike 所说的 python-docx 是在 v0.2.0（现在是 0.8.6）之后完全重写的遗留版本。文档在这里：http://python-docx.readthedocs.org/en/latest/

提取 MS Word 文档格式元素以及原始文本信息

Extracting MS Word document formatting elements along with raw text information

python

ms-word

python-docx

Here the first line is a large header with one sentence.

The second line is slightly smaller. It also has two sentences.

Even smaller. But that's not all. This line has three sentences.