Python 向量化函数和调用 save_dictionary 问题

Python vectorize function and calling save_dictionary issue

我正在创建一个执行以下操作的矢量化函数。

将字符串参数作为文本数据文件所在的路径(文件夹); 处理路径下的所有数据文件,并产生TF和DF统计;

我修复了上次提交的代码,想知道如何调用 save_dictionary() 函数将带有 TF(术语频率)的文档字典保存到文件中,文件名应该在该文件中tf_DOCID.txt 在同一路径上。

class Document: 
    def __init__(self, doc_id):
        # create a new document with its ID
        self.id = doc_id
        # create an empty dictionary 
        # that will hold the term frequency (TF) counts
        self.tfs = {}

    def tokenization(self, text):
        # split a title into words, 
        # using space " " as delimiter
        words = text.lower().split(" ")
        for word in words: 
           # for each word in the list
           if word in self.tfs: 
               # if it has been counted in the TF dictionary
               # add 1 to the count
               self.tfs[word] = self.tfs[word] + 1
           else:
               # if it has not been counted, 
               # initialize its TF with 1
               self.tfs[word] = 1

def save_dictionary(diction_data, file_path_name):
    # print the key-values pair in a dictionary
    f = open("./textfiles", "w+")
    for key in diction_data: 
        f.print(key, diction_data[key])
        f.close()

def vectorize(data_path):
    Document = []
    for i in range(1, 21):
        file_name = "./textfiles/"+ i + ".txt"
        # create a new document with an ID
    Document = Document(i+1)
        #Read the files
    f = open(Document)
    print(f.read())
        # compute the term frequencies
    Document.tokenization(file_name)
        # add the documents to the lists
    Documents.append(Document)

检查向量化函数:
1) 文档未定义
2) 我假设您想创建一个空列表:document = []

我认为你在 python 方面差距不大。没有进入 class 实现,但这里有一些评论: 请注意,您在标记化中只传递了路径,但在使用它作为文件文本的方法中,您首先需要打开文件路径并读取其内容。

def vectorize(data_path):
    documents = [] # No need to declare the type of the array
    for i in range(1, 21):
        file_name = "./textfiles/"+ i + ".txt"
        # create a new document with an ID
    doc= Document(i+1) # Initiation
        # compute the term frequencies
    doc.tokenization(file_name)
        # add the documents to the lists
    documents .append(doc) # appending a current document to documents array

哦,显然请将 class 名称更改为 Document as goes pep8 约定 你可以在这里看更多: pep8

关于save_dictionary func:我会把它作为Document class的一个方法。 并使用 json 将其保存到文件中:

import json
def save_dictionary(self):
    # print the key-values pair in a dictionary
    with open(f'somepath/tf_{self.id}.txt', 'w') as f:
        f.write(json.dumps(self.tfs))