如何使用 python 从 docx 文件的标题下提取文本
How to extract text from under headings in a docx file using python
我想提取 docx 文件中标题下的文本。文本结构有点像这样:
1. DESCRIPTION
Some text here
2. TERMS AND SERVICES
2.1 Some text here
2.2 Some text here
3. PAYMENTS AND FEES
Some text here
我要找的是这样的:
['1. DESCRIPTION','Some text here']
['2. TERMS AND SERVICES','2.1 Some text here 2.2 Some text here']
['3. PAYMENTS AND FEES', 'Some text here']
我试过使用 python-docx 库:
from docx import Document
document = Document('Test.docx')
def iter_headings(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Normal'):
yield paragraph
for heading in iter_headings(document.paragraphs):
print (heading.text)
我的样式在普通、Body 文本和标题 # 之间有所不同。有时标题是普通的,该部分的文本是 Body 文本样式。有人可以指导我正确的方向吗?真的很感激。
你有办法。
提取内容后,只需将具有“Normal”大小写和“BOLD”的部分也标记为标题。但是你必须小心地放置这个逻辑,这样普通段落中出现的粗体字符不会受到影响,即(普通段落中出现的粗体字符只是为了突出该段落中的重要术语)。
您可以通过扫描每个段落,然后遍历该段落的所有运行来检查“该段落中的所有运行是否都是粗体”来执行此操作。因此,如果特定“正常”段落中的所有运行的 属性 都为“粗体”,您可以得出结论,这是一个“标题”。
要应用上述逻辑,您可以在迭代文档的段落时使用以下代码:
#Iterate over paragraphs
for paragraph in document.paragraphs:
#Perform the below logic only for paragraph content which does not have it's native style as "Heading"
if "Heading" not in paragraph.style.name:
#Start of by initializing an empty string to store bold words inside a run
runboldtext = ''
# Iterate over all runs of the current paragraph and collect all the words which are bold into the varible "runboldtext"
for run in paragraph.runs:
if run.bold:
runboldtext = runboldtext + run.text
# Now check if the value of "runboldtext" matches the entire paragraph text. If it matches, it means all the words in the current paragraph are bold and can be considered as a heading
if runboldtext == str(paragraph.text) and runboldtext != '':
print("Heading True for the paragraph: ",runboldtext)
style_of_current_paragraph = 'Heading'
我想提取 docx 文件中标题下的文本。文本结构有点像这样:
1. DESCRIPTION
Some text here
2. TERMS AND SERVICES
2.1 Some text here
2.2 Some text here
3. PAYMENTS AND FEES
Some text here
我要找的是这样的:
['1. DESCRIPTION','Some text here']
['2. TERMS AND SERVICES','2.1 Some text here 2.2 Some text here']
['3. PAYMENTS AND FEES', 'Some text here']
我试过使用 python-docx 库:
from docx import Document
document = Document('Test.docx')
def iter_headings(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Normal'):
yield paragraph
for heading in iter_headings(document.paragraphs):
print (heading.text)
我的样式在普通、Body 文本和标题 # 之间有所不同。有时标题是普通的,该部分的文本是 Body 文本样式。有人可以指导我正确的方向吗?真的很感激。
你有办法。
提取内容后,只需将具有“Normal”大小写和“BOLD”的部分也标记为标题。但是你必须小心地放置这个逻辑,这样普通段落中出现的粗体字符不会受到影响,即(普通段落中出现的粗体字符只是为了突出该段落中的重要术语)。
您可以通过扫描每个段落,然后遍历该段落的所有运行来检查“该段落中的所有运行是否都是粗体”来执行此操作。因此,如果特定“正常”段落中的所有运行的 属性 都为“粗体”,您可以得出结论,这是一个“标题”。
要应用上述逻辑,您可以在迭代文档的段落时使用以下代码:
#Iterate over paragraphs
for paragraph in document.paragraphs:
#Perform the below logic only for paragraph content which does not have it's native style as "Heading"
if "Heading" not in paragraph.style.name:
#Start of by initializing an empty string to store bold words inside a run
runboldtext = ''
# Iterate over all runs of the current paragraph and collect all the words which are bold into the varible "runboldtext"
for run in paragraph.runs:
if run.bold:
runboldtext = runboldtext + run.text
# Now check if the value of "runboldtext" matches the entire paragraph text. If it matches, it means all the words in the current paragraph are bold and can be considered as a heading
if runboldtext == str(paragraph.text) and runboldtext != '':
print("Heading True for the paragraph: ",runboldtext)
style_of_current_paragraph = 'Heading'