解析 Docx 文件内容 w.r.t。标题

Question

我想使用 python-docx 解析 docx 文件的结构及其内容。文件结构使用 'Heading 1' 到 'Heading 6'。任何标题下的内容都可以采用 table 元素的形式。

我了解如何使用 python-docx:

提取彼此独立的标题和 table

    doc = Document("file.docx")
    for paragraph in doc.paragraphs:
        if paragraph.style == doc.styles['Heading 1']:
            indent = 1
            result.append('- %s' % paragraph.text.strip())
        elif paragraph.style == doc.styles['Heading 2']:
            indent = 2
            result.append('  ' * indent + '- %s:' % paragraph.text.strip())
        elif paragraph.style == doc.styles['Heading 3']:
            indent = 3
            result.append('  ' * indent + '- %s:' % paragraph.text.strip())
        [...]
        else:
            [...]

    for table in doc.tables:
        if _is_content(table.row_cells(0)[0].text):
            result.add_table(table)

我的问题是保留结构。如何在源文档中找到标题为 table 的下方？

Answer 1

您可以使用 xml 从 docx 文件中提取结构化信息。试试这个：

doc = Document("file.docx")
headings = [] #extract only headings from your code
tables = [] #extract tables from your code
tags = []
all_text = []
schema = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
for elem in doc.element.getiterator():
    if elem.tag == schema + 'body':
        for i, child in enumerate(elem.getchildren()):
            if child.tag != schema + 'tbl':
                 node_text = child.text
                 if node_text:
                     if node_text in headings:
                         tags.append('heading')
                     else:
                         tags.append('text')
                     all_text.append(node_text)
             else:
                 tags.append('table')
        break

在上面的代码之后，您将获得标签列表，其中将显示文档标题、文本和 table 的结构，然后您可以映射列表中的相应数据。

解析 Docx 文件内容 w.r.t。标题

Parse Docx file content w.r.t. headings

python

python-docx