解析 Python 中的 docx 文件

Question

我正在尝试从多个 docx 文件中读取标题。令人讨厌的是，这些标题没有可识别的段落样式。所有段落都有“正常”段落样式，所以我使用正则表达式。标题以粗体显示，结构如下：

一个。猫

乙。狗

C。猪

D.福克斯

如果一个文件中的标题超过 26 个，则这些标题将以“AA.”、“BB.”等开头

我有以下代码，除了任何以“D.”开头的标题外，哪种作品打印两次，例如 [猫狗猪狐狸]

import os
from docx import Document
import re

directory = input("Copy and paste the location of the files.\n").lower()

for file in os.listdir(directory):

    document = Document(directory+file)

    head1s = []

    for paragraph in document.paragraphs:

        heading = re.match(r'^[A-Z]+[.]\s', paragraph.text)

        for run in paragraph.runs:

            if run.bold:

                if heading:
                    head1 = paragraph.text
                    head1 = head1.split('.')[1]
                    head1s.append(head1)

    print(head1s)

任何人都可以告诉我是不是代码有问题导致了这种情况发生？据我所知，Word 文件中这些特定标题的格式或结构没有任何独特之处。

Answer 1

发生的事情是循环继续经过 D.Fox，因此在这个新循环中，即使没有匹配，它也会打印 head1 的最后一个值，即 D.Fox .

我认为是 for run in paragraph.runs: 以某种方式运行宁了两次，也许还有第二个 "run" 在那里但不可见？

也许在找到第一个匹配项时添加一个中断足以防止第二个运行触发？

for file in os.listdir(directory):

document = Document(directory+file)

head1s = []

for paragraph in document.paragraphs:

    heading = re.match(r'^[A-Z]+[.]\s', paragraph.text)

    for run in paragraph.runs:

        if run.bold:

            if heading:
                head1 = paragraph.text
                head1 = head1.split('.')[1]
                head1s.append(head1)
                # this break stops the run loop if a match was found.
                break

print(head1s)

解析 Python 中的 docx 文件

Parsing docx files in Python

python

regex

python-docx