阿拉伯文本分析期间 Python 中的 FileNotFoundError

Question

我有一个文件夹有 150 个阿拉伯文本文件。我想找到彼此之间的相似之处。我怎样才能做到这一点？我试过解释 here

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

但我在申报文件时遇到了问题。我将其修改为：

from sklearn.feature_extraction.text import TfidfVectorizer

text_files= r"C:\Users\Nujou\Desktop\Master\thesis\corpora\modified Corpora\Training setK\ST"
for f in text_files:
    documents= open(f, 'r', encoding='utf-8-sig').read()
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

但出现这个错误：

documents= open(f, 'r', encoding='utf-8-sig').read()
FileNotFoundError: [Errno 2] No such file or directory: 'C'

有什么解决办法吗？

编辑：

我也试过这个：

from sklearn.feature_extraction.text import TfidfVectorizer

import os

text_files= os.listdir(r"C:\Users\Nujou\Desktop\Master\thesis\corpora\modified Corpora\Training setK\ST")

documents= []
for f in text_files:
    file= open(f, 'r', 'utf-8-sig')
    documents.append(file.read())
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

并且发生了这个错误：

file= open(f, 'r', 'utf-8-sig')
TypeError: an integer is required (got type str)

Answer 1

你对阿拉伯语文本的比较没有问题。您无法将文档加载到 Python.

如果ST是文件夹，则需要获取文件夹内所有文件的列表：

import os
inputDir = r'your/path/here'
text_files = os.listdir(inputDir)

documents = []
for f in text_files:
    file = open(os.path.join(inputDir, f), 'r', encoding = 'utf-8-sig')
    documents.append(file.read())

您的代码的当前版本也只保留循环中的最后一个文档，而不是所有文档。但是，这是另一个问题的另一个问题。

阿拉伯文本分析期间 Python 中的 FileNotFoundError

FileNotFoundError in Python during Arabic text analysis

python

file-not-found

python-3.x