阿拉伯文本分析期间 Python 中的 FileNotFoundError
FileNotFoundError in Python during Arabic text analysis
我有一个文件夹有 150 个阿拉伯文本文件。我想找到彼此之间的相似之处。
我怎样才能做到这一点?
我试过解释 here
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T
但我在申报文件时遇到了问题。
我将其修改为:
from sklearn.feature_extraction.text import TfidfVectorizer
text_files= r"C:\Users\Nujou\Desktop\Master\thesis\corpora\modified Corpora\Training setK\ST"
for f in text_files:
documents= open(f, 'r', encoding='utf-8-sig').read()
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T
但出现这个错误:
documents= open(f, 'r', encoding='utf-8-sig').read()
FileNotFoundError: [Errno 2] No such file or directory: 'C'
有什么解决办法吗?
编辑:
我也试过这个:
from sklearn.feature_extraction.text import TfidfVectorizer
import os
text_files= os.listdir(r"C:\Users\Nujou\Desktop\Master\thesis\corpora\modified Corpora\Training setK\ST")
documents= []
for f in text_files:
file= open(f, 'r', 'utf-8-sig')
documents.append(file.read())
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T
并且发生了这个错误:
file= open(f, 'r', 'utf-8-sig')
TypeError: an integer is required (got type str)
你对阿拉伯语文本的比较没有问题。您无法将文档加载到 Python.
如果ST
是文件夹,则需要获取文件夹内所有文件的列表:
import os
inputDir = r'your/path/here'
text_files = os.listdir(inputDir)
documents = []
for f in text_files:
file = open(os.path.join(inputDir, f), 'r', encoding = 'utf-8-sig')
documents.append(file.read())
您的代码的当前版本也只保留循环中的最后一个文档,而不是所有文档。但是,这是另一个问题的另一个问题。
我有一个文件夹有 150 个阿拉伯文本文件。我想找到彼此之间的相似之处。 我怎样才能做到这一点? 我试过解释 here
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T
但我在申报文件时遇到了问题。 我将其修改为:
from sklearn.feature_extraction.text import TfidfVectorizer
text_files= r"C:\Users\Nujou\Desktop\Master\thesis\corpora\modified Corpora\Training setK\ST"
for f in text_files:
documents= open(f, 'r', encoding='utf-8-sig').read()
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T
但出现这个错误:
documents= open(f, 'r', encoding='utf-8-sig').read()
FileNotFoundError: [Errno 2] No such file or directory: 'C'
有什么解决办法吗?
编辑:
我也试过这个:
from sklearn.feature_extraction.text import TfidfVectorizer
import os
text_files= os.listdir(r"C:\Users\Nujou\Desktop\Master\thesis\corpora\modified Corpora\Training setK\ST")
documents= []
for f in text_files:
file= open(f, 'r', 'utf-8-sig')
documents.append(file.read())
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T
并且发生了这个错误:
file= open(f, 'r', 'utf-8-sig')
TypeError: an integer is required (got type str)
你对阿拉伯语文本的比较没有问题。您无法将文档加载到 Python.
如果ST
是文件夹,则需要获取文件夹内所有文件的列表:
import os
inputDir = r'your/path/here'
text_files = os.listdir(inputDir)
documents = []
for f in text_files:
file = open(os.path.join(inputDir, f), 'r', encoding = 'utf-8-sig')
documents.append(file.read())
您的代码的当前版本也只保留循环中的最后一个文档,而不是所有文档。但是,这是另一个问题的另一个问题。