按标题将 docx 拆分为 Python 中的单独文件
Splitting a docx by headings into separate files in Python
我想编写一个程序来抓取我的 docx 文件,遍历它们并根据标题将每个文件拆分为多个单独的文件。在每个 docx 中都有几篇文章,每篇文章下面都有一个 'Heading 1' 和文本。
因此,如果我的原始 file1.docx 有 4 篇文章,我希望将它分成 4 个单独的文件,每个文件都有标题和文本。
我到达了它遍历我保存 .docx 文件的路径中的所有文件的部分,我可以分别阅读标题和文本,但我似乎无法找到一种方法如何合并所有内容并将其拆分为单独的文件,每个文件都有标题和文本。我正在使用 python-docx 库。
import glob
from docx import Document
headings = []
texts = []
def iter_headings(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Heading'):
yield paragraph
def iter_text(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Normal'):
yield paragraph
for name in glob.glob('/*.docx'):
document = Document(name)
for heading in iter_headings(document.paragraphs):
headings.append(heading.text)
for paragraph in iter_text(document.paragraphs):
texts.append(paragraph.text)
print(texts)
如何提取每篇文章的正文和标题?
这是XML阅读python-docx给我的。红色大括号标记了我要从每个文件中提取的内容。
https://user-images.githubusercontent.com/17858776/51575980-4dcd0200-1eac-11e9-95a8-f643f87b1f40.png
我愿意接受关于如何使用不同方法实现我想要的任何替代建议,或者是否有更简单的方法来使用 PDF 文件。
我认为使用迭代器的方法是一种合理的方法,但我倾向于对它们进行不同的划分。在顶层你可以有:
for paragraphs in iterate_document_sections(document.paragraphs):
create_document_from_paragraphs(paragraphs)
然后 iterate_document_sections()
看起来像:
def iterate_document_sections(document):
"""Generate a sequence of paragraphs for each headed section in document.
Each generated sequence has a heading paragraph in its first position,
followed by one or more body paragraphs.
"""
paragraphs = [document.paragraphs[0]]
for paragraph in document.paragraphs[1:]:
if is_heading(paragraph):
yield paragraphs
paragraphs = [paragraph]
continue
paragraphs.append(paragraph)
yield paragraphs
像这样的东西与您的其他部分代码相结合应该会给您一些可行的开始。您需要实施 is_heading()
和 create_document_from_paragraphs()
.
请注意,此处的术语 "section" 用作通用出版用语,指的是(部分)标题及其从属段落,而不是指 Word 文档部分 object(喜欢 document.sections
).
事实上,仅当文档除了段落(例如表格)之外没有任何其他元素时,提供的解决方案才有效。
另一种可能的解决方案是不仅遍历段落而且遍历所有文档 body 的 child xml 元素。一旦找到“子文档”的开始和结束元素(示例中带有标题的段落),您应该删除与该部分无关的其他 xml 元素(一种切断所有其他文档内容的方式)。这样您就可以保留所有样式、文本、表格和其他文档元素和格式。
这不是一个优雅的解决方案,意味着您必须在内存中保留完整源文档的临时副本。
这是我的代码:
import tempfile
from typing import Generator, Tuple, Union
from docx import Document
from docx.document import Document as DocType
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.oxml.xmlchemy import BaseOxmlElement
from docx.text.paragraph import Paragraph
def iterparts(doc_path:str, skip_first=True, bias:int=0) -> Generator[Tuple[int,DocType],None,None]:
"""Iterate over sub-documents by splitting source document into parts
Split into parts by copying source document and cutting off unrelevant
data.
Args:
doc_path (str): path to source *docx* file
skip_first (bool, optional): skip first split point and wait for
second occurrence. Defaults to True.
bias (int, optional): split point bias. Defaults to 0.
Yields:
Generator[Tuple[int,DocType],None,None]: first element of each tuple
indicates the number of a
sub-document, if number is 0
then there are no sub-documents
"""
doc = Document(doc_path)
counter = 0
while doc:
split_elem_idx = -1
doc_body = doc.element.body
cutted = [doc, None]
for idx, elem in enumerate(doc_body.iterchildren()):
if is_split_point(elem):
if split_elem_idx == -1 and skip_first:
split_elem_idx = idx
else:
cutted = split(doc, idx+bias) # idx-1 to keep previous paragraph
counter += 1
break
yield (counter, cutted[0])
doc = cutted[1]
def is_split_point(element:BaseOxmlElement) -> bool:
"""Split criteria
Args:
element (BaseOxmlElement): oxml element
Returns:
bool: whether the *element* is the beginning of a new sub-document
"""
if isinstance(element, CT_P):
p = Paragraph(element, element.getparent())
return p.text.startswith("Some text")
return False
def split(doc:DocType, cut_idx:int) -> Tuple[DocType,DocType]:
"""Splitting into parts by copying source document and cutting of
unrelevant data.
Args:
doc (DocType): [description]
cut_idx (int): [description]
Returns:
Tuple[DocType,DocType]: [description]
"""
tmpdocfile = write_tmp_doc(doc)
second_part = doc
second_elems = list(second_part.element.body.iterchildren())
for i in range(0, cut_idx):
remove_element(second_elems[i])
first_part = Document(tmpdocfile)
first_elems = list(first_part.element.body.iterchildren())
for i in range(cut_idx, len(first_elems)):
remove_element(first_elems[i])
tmpdocfile.close()
return (first_part, second_part)
def remove_element(elem: Union[CT_P,CT_Tbl]):
elem.getparent().remove(elem)
def write_tmp_doc(doc:DocType):
tmp = tempfile.TemporaryFile()
doc.save(tmp)
return tmp
请注意,您应该根据您的拆分条件定义 is_split_point
方法
我想编写一个程序来抓取我的 docx 文件,遍历它们并根据标题将每个文件拆分为多个单独的文件。在每个 docx 中都有几篇文章,每篇文章下面都有一个 'Heading 1' 和文本。
因此,如果我的原始 file1.docx 有 4 篇文章,我希望将它分成 4 个单独的文件,每个文件都有标题和文本。
我到达了它遍历我保存 .docx 文件的路径中的所有文件的部分,我可以分别阅读标题和文本,但我似乎无法找到一种方法如何合并所有内容并将其拆分为单独的文件,每个文件都有标题和文本。我正在使用 python-docx 库。
import glob
from docx import Document
headings = []
texts = []
def iter_headings(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Heading'):
yield paragraph
def iter_text(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Normal'):
yield paragraph
for name in glob.glob('/*.docx'):
document = Document(name)
for heading in iter_headings(document.paragraphs):
headings.append(heading.text)
for paragraph in iter_text(document.paragraphs):
texts.append(paragraph.text)
print(texts)
如何提取每篇文章的正文和标题?
这是XML阅读python-docx给我的。红色大括号标记了我要从每个文件中提取的内容。
https://user-images.githubusercontent.com/17858776/51575980-4dcd0200-1eac-11e9-95a8-f643f87b1f40.png
我愿意接受关于如何使用不同方法实现我想要的任何替代建议,或者是否有更简单的方法来使用 PDF 文件。
我认为使用迭代器的方法是一种合理的方法,但我倾向于对它们进行不同的划分。在顶层你可以有:
for paragraphs in iterate_document_sections(document.paragraphs):
create_document_from_paragraphs(paragraphs)
然后 iterate_document_sections()
看起来像:
def iterate_document_sections(document):
"""Generate a sequence of paragraphs for each headed section in document.
Each generated sequence has a heading paragraph in its first position,
followed by one or more body paragraphs.
"""
paragraphs = [document.paragraphs[0]]
for paragraph in document.paragraphs[1:]:
if is_heading(paragraph):
yield paragraphs
paragraphs = [paragraph]
continue
paragraphs.append(paragraph)
yield paragraphs
像这样的东西与您的其他部分代码相结合应该会给您一些可行的开始。您需要实施 is_heading()
和 create_document_from_paragraphs()
.
请注意,此处的术语 "section" 用作通用出版用语,指的是(部分)标题及其从属段落,而不是指 Word 文档部分 object(喜欢 document.sections
).
事实上,仅当文档除了段落(例如表格)之外没有任何其他元素时,提供的解决方案才有效。
另一种可能的解决方案是不仅遍历段落而且遍历所有文档 body 的 child xml 元素。一旦找到“子文档”的开始和结束元素(示例中带有标题的段落),您应该删除与该部分无关的其他 xml 元素(一种切断所有其他文档内容的方式)。这样您就可以保留所有样式、文本、表格和其他文档元素和格式。 这不是一个优雅的解决方案,意味着您必须在内存中保留完整源文档的临时副本。
这是我的代码:
import tempfile
from typing import Generator, Tuple, Union
from docx import Document
from docx.document import Document as DocType
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.oxml.xmlchemy import BaseOxmlElement
from docx.text.paragraph import Paragraph
def iterparts(doc_path:str, skip_first=True, bias:int=0) -> Generator[Tuple[int,DocType],None,None]:
"""Iterate over sub-documents by splitting source document into parts
Split into parts by copying source document and cutting off unrelevant
data.
Args:
doc_path (str): path to source *docx* file
skip_first (bool, optional): skip first split point and wait for
second occurrence. Defaults to True.
bias (int, optional): split point bias. Defaults to 0.
Yields:
Generator[Tuple[int,DocType],None,None]: first element of each tuple
indicates the number of a
sub-document, if number is 0
then there are no sub-documents
"""
doc = Document(doc_path)
counter = 0
while doc:
split_elem_idx = -1
doc_body = doc.element.body
cutted = [doc, None]
for idx, elem in enumerate(doc_body.iterchildren()):
if is_split_point(elem):
if split_elem_idx == -1 and skip_first:
split_elem_idx = idx
else:
cutted = split(doc, idx+bias) # idx-1 to keep previous paragraph
counter += 1
break
yield (counter, cutted[0])
doc = cutted[1]
def is_split_point(element:BaseOxmlElement) -> bool:
"""Split criteria
Args:
element (BaseOxmlElement): oxml element
Returns:
bool: whether the *element* is the beginning of a new sub-document
"""
if isinstance(element, CT_P):
p = Paragraph(element, element.getparent())
return p.text.startswith("Some text")
return False
def split(doc:DocType, cut_idx:int) -> Tuple[DocType,DocType]:
"""Splitting into parts by copying source document and cutting of
unrelevant data.
Args:
doc (DocType): [description]
cut_idx (int): [description]
Returns:
Tuple[DocType,DocType]: [description]
"""
tmpdocfile = write_tmp_doc(doc)
second_part = doc
second_elems = list(second_part.element.body.iterchildren())
for i in range(0, cut_idx):
remove_element(second_elems[i])
first_part = Document(tmpdocfile)
first_elems = list(first_part.element.body.iterchildren())
for i in range(cut_idx, len(first_elems)):
remove_element(first_elems[i])
tmpdocfile.close()
return (first_part, second_part)
def remove_element(elem: Union[CT_P,CT_Tbl]):
elem.getparent().remove(elem)
def write_tmp_doc(doc:DocType):
tmp = tempfile.TemporaryFile()
doc.save(tmp)
return tmp
请注意,您应该根据您的拆分条件定义 is_split_point
方法