我的矢量化项目不工作。我将如何解决这个问题

Question

class Document: 
    def __init__(self, doc_id):
        # create a new document with its ID
        self.id = doc_id
        # create an empty dictionary 
        # that will hold the term frequency (TF) counts
        self.tfs = {}

   def tokenization(self, text):
        # split a title into words, 
        # using space " " as delimiter
        words = text.lower().split(" ")
        for word in words: 
            # for each word in the list
            if word in self.tfs: 
                # if it has been counted in the TF dictionary
                # add 1 to the count
                self.tfs[word] = self.tfs[word] + 1
            else:
                # if it has not been counted, 
                # initialize its TF with 1
                self.tfs[word] = 1


    def save_dictionary(diction_data, file_path_name):
        f = open("./textfiles", "w+")

        for key in diction_data:
            # Separate the key from the frequency with a space and
            # add a newline to the end of each key value pair
            f.write(key + " " + str(diction_data[key]) + "\n")

        f.close()


    def vectorize(data_path):
        Documents = []
        for i in range(1, 21):
            file_name = "./textfiles/"+ i + ".txt"
            # create a new document with an ID
        doc = Document(i+1)
            #Read the files
        f = open(file_name)
        print(f.read())
            # compute the term frequencies
            #read in the files contents
        doc.tokenization(f.read())
            # add the documents to the lists
        Documents.append(doc)

     save_dictionary(doc.tfs, "tf_" + str(doc.id) + ".txt")

     DFS = {}
     for doc in Documents:
        for word in doc.tfs:
        DFS[word] = DFS.get(word,0) + 1

    save_dictionary(doc.DFS, "DFS_" + str(doc.id) + ".txt")


    vectorize("./textfiles")

以上是我的代码，但无法正常工作。我为文档字典中的每个单词添加了一个嵌套循环以执行以下操作：如果它没有出现在 DF 的字典中，则将该单词添加到 DF 字典；

如果它已经在DF字典中，通过给自身加1来增加它的DF值；

然后在处理完所有文件后，我再次调用 save_dictionary() 函数将 DF 字典保存到一个名为 df.txt 的文件中，该文件与输入文本文件位于同一路径中。然后向量化。

当我运行代码没有任何反应所以我肯定在某处做错了任何帮助将不胜感激。

Answer 1

如评论中所述，您在多个地方的缩进是错误的。请先解决这个问题。例如，在 vectorize() 中，i 在 doc 赋值中被引用，但它在定义 i 的 for 循环的局部范围之外。

此外，将您的逻辑代码与脚本部分分开，以便于调试会很有帮助。

更新:

save_dictionary 和 vectorize() 要么需要 self 作为第一个函数参数成为 Document class 的一部分，要么需要 @staticmethod 装饰器。此外，i 仍然在 for 循环之外引用，它仅适用。对 vectorize 的更改我建议修复关于 for 循环的缩进并使用 with 指定创建上下文管理器以正确且轻松地打开和关闭文件：

def vectorize(self, data_path):
    Documents = []
    for i in range(1, 21):
        file_name = "./textfiles/"+ str(i) + ".txt"
        # create a new document with an ID
        doc = Document(i+1)
        #Read the files
        with open(file_name, 'r') as f:
            text = f.read()
            # compute the term frequencies
            #read in the files contents
            doc.tokenization(text)
            # add the documents to the lists
        Documents.append(doc)

上下文管理器在退出 with context/indentation 时自动关闭文件。

我的矢量化项目不工作。我将如何解决这个问题

My vectorize project isn't working. How would I go about fixing this

python

vectorization