如何将编码文本转换为正文（没有编码创建的特殊字符）

Question

我打算从一系列 PDF 文件中提取文本来进行主题建模。从 PdF 文件中提取文本后，我将把每个 PDF 文件的文本保存在 .txt 文件或 .doc 文件中。为此，我遇到了一个错误，我应该添加 .encode('utf-8') 以将提取的文本保存在 .txt 文件中。所以，我添加了 txt = str(txt.encode('utf-8'))。问题是读取 .txt 文件，当我读取 .txt 文件时，由于 UTF-8，它们有特殊字符，我不知道如何才能没有这些字符的正文。我申请解码没用

我采用了另一种方法来避免保存为 .txt 格式，我打算将提取的文本保存在数据框中，但我发现前几页保存在数据框中！

如果您能分享您的解决方案以读取 .txt 文件并删除与编码相关的字符 ('utf-8') 以及我如何将提取的文本保存在数据框中，我将不胜感激。

import pdfplumber
import pandas as pd
import  codecs

txt = ''

with pdfplumber.open(r'C:\Users\thmagrdPaperLDA\A1.pdf') as pdf:
    pages = pdf.pages
    for i, pg in enumerate (pages):
            txt += pages [i].extract_text()
        
print (txt)

data = {'text': [txt]}
df = pd.DataFrame(data)


####write in .txt file
text_file = open("Test.txt", "wt")
txt = str(txt.encode('utf-8'))
n = text_file.write(txt)
text_file.close()

####read from .txt file
with codecs.open('Test.txt', 'r', 'utf-8') as f:
    for line in f:
        print (line)

Answer 1

你写的文件有误。不要对文本进行编码，而是在打开文件时声明编码，然后在不编码的情况下写入文本 - Python 将自动对其进行编码。

应该是


####write in .txt file
with open("Test.txt", "wt", encoding='utf-8') as text_file:
    n = text_file.write(txt)

除非你使用Python 2你不需要使用编解码器打开编码文件，同样你可以在open函数中声明编码：

with open("Test.txt", "rt", encoding='utf-8') as f:
    for line in f:
        print(line)

如何将编码文本转换为正文（没有编码创建的特殊字符）

How encode text can be converted to main text (without special character created by encoding)

python

encoding

nlp

utf-8