使用 textract 从 pptx 获取文本。和文档。没有标签

Question

我使用以下代码从 docx 中获取字符串。或 pptx。（因为 textract 不能正确处理非 acsii 符号，我使用描述的解决方案 here）：

import textract as txt
text = txt.process("D:\Corpus\Exposee.pptx")
text = text.decode("utf8")

然后我调用 text 并得到如下所示的字符串：

'Syntaktische Besonderheiten \n\ndes Maschinellen Verstehens \n\nder Deutschen Sprache \n\nin der Multilingualen Perspektive\n\nMarvin Teller\n\nForschungsfrage\n\nW\n\nelche\n\n \n\nEigenschaften\n\n \n\n\n\n\tder \n\nsyntaktischen\n\n \n\nStruktur\n\n der \n\n

（缩短）

我想要字符串不带 \n 和 \t 之类的标签，该怎么做？

提前抱歉 duplication/naiveness

Answer 1

根据评论：您看到的文本采用您从文件中提取的形式。 \n 允许您有段落。通过放置该文本（字符串）并打印它，您可以看到它构成了段落。要摆脱它，您需要执行 text (variable) = text.replace("\n", "")，这会将 "\n" 替换为任何内容 "".

使用 textract 从 pptx 获取文本。和文档。没有标签

Using textract to get text from pptx. and docx. without tags

python

tags

powerpoint

text-extraction