如何使用 python docx 修复破碎的文本以获得电子书的免费文本？

Question

我正在尝试将我在网上找到的一本免费电子书编辑成 Kindle 易于阅读的文本，其中包含 headers 和完整的段落。

我对 Python 和编码还很陌生，所以我真的没有任何进步。

每行由回车符分隔，因此每行被 python 视为一个单独的段落。

基本上需要做的是删除 space 并在行之间换行，这样文本在转换为 MOBI 或 EPUB 时不会中断。

文本看起来像这样：

未格式化：

应该是这样的：

格式化：

欢迎任何帮助！

Answer 1

我用的是默认没有安装的docx库，你可以用pip或者conda:

pip install python-docx
conda install python-docx --channel conda-forge

安装后：

from docx import Document
doc = Document(r'path\to\file\pride_and_prejudice.docx')
all_text=[]
all_text_str=''

for para in doc.paragraphs:
    all_text.append(para.text)

all_text_str=all_text_str.join(all_text)

clean_text=all_text_str.replace('\n', '')   # Remove linebreaks
clean_text=clean_text.replace('  ', '')    # Remove even number of spaces (e.g. This usually eliminates non-spaces nicely, but you can tweak accordingly.

document = Document()
p = document.add_paragraph(clean_text)
document.save(r'path\to\file\pride_and_prejudice_clean.docx')

如何使用 python docx 修复破碎的文本以获得电子书的免费文本？

How to fix broken up text with python docx to get free text for Ebooks?

python

ms-word

python-docx