使用 python-docx 阅读口音

Question

我想使用 python-docx 获取一些 docx 文件的纯文本，但由于文本是用西班牙语编写的，所以我很难处理重音。

我正在使用 this answer 阅读文本：

def getText(filename):
  doc = docx.Document(filename)
  fullText = []
  for para in doc.paragraphs:
      fullText.append(para.text('utf-8'))
  return '\n'.join(fullText)

其中 returns 是这样的：

n\xc3\xbamero //should be número

有什么方法可以使文本具有正确的重音符号？

当我尝试使用以下方法将此文本写入文件时：

file = open("/mnt/c/Users/lulas/Desktop/inSpanish/txt/course1.txt","w")
file.write(text)

我收到这个错误：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 27: ordinal not in range(128)

这是由于口音 read/encoded。

Answer 1

没有文字，只有经过编码的文字。

您正在创建一个文本文件。文本文件是用字符编码编写的。该错误表明您正在写入的文本包含您的字符编码不支持的字符。

因此，您要么选择不同的编码，要么不写入这些字符。请记住 1) reader 必须知道文件使用哪种编码，以便必须传达 and/or 达成一致。 2) 原始角色可能非常有价值，因此删除或替换它们可能是一个糟糕的选择。

由于源文件 (docx) 使用 Unicode 字符集，因此 Unicode 编码可能是最佳选择。对于存储和流式传输 Unicode，UTF-8 是最常见的编码。所以，

file = open("/mnt/c/Users/lulas/Desktop/inSpanish/txt/course1.txt","w", encoding="utf-8")
file.write(text)

我不认为问题出在阅读上。 n\xc3\xbamero 是用 UTF-8 编码时 número 的表示。无论向您展示什么，它都只是想成为 "helpful"。

使用 python-docx 阅读口音

Reading accents with python-docx

python

docx

character-encoding

python-docx