我的矢量化项目不工作。我将如何解决这个问题
My vectorize project isn't working. How would I go about fixing this
class Document:
def __init__(self, doc_id):
# create a new document with its ID
self.id = doc_id
# create an empty dictionary
# that will hold the term frequency (TF) counts
self.tfs = {}
def tokenization(self, text):
# split a title into words,
# using space " " as delimiter
words = text.lower().split(" ")
for word in words:
# for each word in the list
if word in self.tfs:
# if it has been counted in the TF dictionary
# add 1 to the count
self.tfs[word] = self.tfs[word] + 1
else:
# if it has not been counted,
# initialize its TF with 1
self.tfs[word] = 1
def save_dictionary(diction_data, file_path_name):
f = open("./textfiles", "w+")
for key in diction_data:
# Separate the key from the frequency with a space and
# add a newline to the end of each key value pair
f.write(key + " " + str(diction_data[key]) + "\n")
f.close()
def vectorize(data_path):
Documents = []
for i in range(1, 21):
file_name = "./textfiles/"+ i + ".txt"
# create a new document with an ID
doc = Document(i+1)
#Read the files
f = open(file_name)
print(f.read())
# compute the term frequencies
#read in the files contents
doc.tokenization(f.read())
# add the documents to the lists
Documents.append(doc)
save_dictionary(doc.tfs, "tf_" + str(doc.id) + ".txt")
DFS = {}
for doc in Documents:
for word in doc.tfs:
DFS[word] = DFS.get(word,0) + 1
save_dictionary(doc.DFS, "DFS_" + str(doc.id) + ".txt")
vectorize("./textfiles")
以上是我的代码,但无法正常工作。我为文档字典中的每个单词添加了一个嵌套循环以执行以下操作:如果它没有出现在 DF 的字典中,则将该单词添加到 DF 字典;
如果它已经在DF字典中,通过给自身加1来增加它的DF值;
然后在处理完所有文件后,我再次调用 save_dictionary() 函数将 DF 字典保存到一个名为 df.txt 的文件中,该文件与输入文本文件位于同一路径中。然后向量化。
当我 运行 代码没有任何反应所以我肯定在某处做错了任何帮助将不胜感激。
如评论中所述,您在多个地方的缩进是错误的。请先解决这个问题。例如,在 vectorize()
中,i
在 doc
赋值中被引用,但它在定义 i
的 for 循环的局部范围之外。
此外,将您的逻辑代码与脚本部分分开,以便于调试会很有帮助。
更新:
save_dictionary
和 vectorize()
要么需要 self
作为第一个函数参数成为 Document
class 的一部分,要么需要 @staticmethod
装饰器。此外,i
仍然在 for 循环之外引用,它仅适用。对 vectorize
的更改我建议修复关于 for 循环的缩进并使用 with
指定创建上下文管理器以正确且轻松地打开和关闭文件:
def vectorize(self, data_path):
Documents = []
for i in range(1, 21):
file_name = "./textfiles/"+ str(i) + ".txt"
# create a new document with an ID
doc = Document(i+1)
#Read the files
with open(file_name, 'r') as f:
text = f.read()
# compute the term frequencies
#read in the files contents
doc.tokenization(text)
# add the documents to the lists
Documents.append(doc)
上下文管理器在退出 with
context/indentation 时自动关闭文件。
class Document:
def __init__(self, doc_id):
# create a new document with its ID
self.id = doc_id
# create an empty dictionary
# that will hold the term frequency (TF) counts
self.tfs = {}
def tokenization(self, text):
# split a title into words,
# using space " " as delimiter
words = text.lower().split(" ")
for word in words:
# for each word in the list
if word in self.tfs:
# if it has been counted in the TF dictionary
# add 1 to the count
self.tfs[word] = self.tfs[word] + 1
else:
# if it has not been counted,
# initialize its TF with 1
self.tfs[word] = 1
def save_dictionary(diction_data, file_path_name):
f = open("./textfiles", "w+")
for key in diction_data:
# Separate the key from the frequency with a space and
# add a newline to the end of each key value pair
f.write(key + " " + str(diction_data[key]) + "\n")
f.close()
def vectorize(data_path):
Documents = []
for i in range(1, 21):
file_name = "./textfiles/"+ i + ".txt"
# create a new document with an ID
doc = Document(i+1)
#Read the files
f = open(file_name)
print(f.read())
# compute the term frequencies
#read in the files contents
doc.tokenization(f.read())
# add the documents to the lists
Documents.append(doc)
save_dictionary(doc.tfs, "tf_" + str(doc.id) + ".txt")
DFS = {}
for doc in Documents:
for word in doc.tfs:
DFS[word] = DFS.get(word,0) + 1
save_dictionary(doc.DFS, "DFS_" + str(doc.id) + ".txt")
vectorize("./textfiles")
以上是我的代码,但无法正常工作。我为文档字典中的每个单词添加了一个嵌套循环以执行以下操作:如果它没有出现在 DF 的字典中,则将该单词添加到 DF 字典;
如果它已经在DF字典中,通过给自身加1来增加它的DF值;
然后在处理完所有文件后,我再次调用 save_dictionary() 函数将 DF 字典保存到一个名为 df.txt 的文件中,该文件与输入文本文件位于同一路径中。然后向量化。
当我 运行 代码没有任何反应所以我肯定在某处做错了任何帮助将不胜感激。
如评论中所述,您在多个地方的缩进是错误的。请先解决这个问题。例如,在 vectorize()
中,i
在 doc
赋值中被引用,但它在定义 i
的 for 循环的局部范围之外。
此外,将您的逻辑代码与脚本部分分开,以便于调试会很有帮助。
更新:
save_dictionary
和 vectorize()
要么需要 self
作为第一个函数参数成为 Document
class 的一部分,要么需要 @staticmethod
装饰器。此外,i
仍然在 for 循环之外引用,它仅适用。对 vectorize
的更改我建议修复关于 for 循环的缩进并使用 with
指定创建上下文管理器以正确且轻松地打开和关闭文件:
def vectorize(self, data_path):
Documents = []
for i in range(1, 21):
file_name = "./textfiles/"+ str(i) + ".txt"
# create a new document with an ID
doc = Document(i+1)
#Read the files
with open(file_name, 'r') as f:
text = f.read()
# compute the term frequencies
#read in the files contents
doc.tokenization(text)
# add the documents to the lists
Documents.append(doc)
上下文管理器在退出 with
context/indentation 时自动关闭文件。