Colab OSError: [Errno 36] File name too long when reading a docx2text file

Question

我正在研究 NLP 技术，虽然我对 .txt 文件有一些经验，但使用 .docx 一直很麻烦。我正在尝试对字符串使用正则表达式，因为我使用的是 word 文档，所以这是我的方法：

我将使用 textract 将 docx 转换为 txt，并将字节转换为字符串：

import textract
my_text = textract.process("1337.docx")
my_text = text.decode("utf-8")

我读了文件：

def load_doc(filename):

  # open the file as read only
  file = open(filename, 'r')

  # read all text
  text = file.read()

  # close the file
  file.close()

  return text

然后我尝试做一些正则表达式，例如删除所有数字等，并在主程序中执行时：

def regextest(doc):

...

...
text = load_doc(my_text)
tokens = regextest(text)
print(tokens)

我得到异常：

OSError: [Errno 36] File name too long: Are you buying a Tesla?\n\n\n\n - I believe the pricing is...(and more text from te file)

我知道我正在将我的 docx 文件转换为文本文件，然后，当我读取“文件名”时，它实际上是整个文本。如何保存文件并使其正常工作？你们会如何处理这个问题？

Answer 1

您似乎正在使用文件的内容 - my_text 作为 load_doc 的 filename 参数，因此出现错误.

我认为您更愿意使用实际文件名之一作为参数，可能 '1337.docx' 而不是此文件的内容。

Colab OSError: [Errno 36] File name too long when reading a docx2text file

Colab OSError: [Errno 36] File name too long when reading a docx2text file

nlp

nltk

python-3.x

python-re