将数据拆分为多个文件：如何处理（未知数量的）多个连接

Question

我想将一个（在现实生活中：巨大的）文件拆分为多个文件，例如数据中的第二列。 IE。在下面的示例中，我需要文件 431.csv 和 rr1.csv。我的主要想法是打开新连接以写入（如果尚未打开）- 打开连接的记录在 dict files_dict 中，然后遍历它并最后关闭。

我被困在如何逐行引用这些连接上。

在现实生活中，这些文件名（第二列）的数量和价值是事先不知道的。

在这里找到一些灵感：

write multiple files at a time

python inserting variable string as file name

data_in中的玩具数据内容：

123,431,t
43,rr1,3
13,rr1,43
123,rr1,4

我现在的天真伪代码：

files_dict = dict() #dict of file names

with open(data_in) as fi:
    for line in fi:
        x = line.split(',')[1]

        if x not in files_dict:
            fo = x + '.csv'
            files_dict[x] = fo

            '''
            open files_dict[x]
            write line to files_dict[x]

            '''
    else:
        '''
        write line to files_dict[x]
        '''

for fo in files_dict.fos:
    fo.close()

Answer 1

将 file 对象本身而不是文件名放入字典。

files_dict = {}

with open(data_in) as fi:
    for line in fi:
        x = line.split(',')[1]

        if x not in files_dict:
            fo = open(x + '.csv', "w")
            files_dict[x] = fo
        else:
            fo = files_dict[x]

        fo.write(x)

for fo in files_dict.values():
    fo.close()

Answer 2

你的想法是正确的，但你应该在字典中存储文件对象而不是文件名，并且你不需要 else 块（它应该与 if 而不是 for):

files_dict = {}

with open(data_in) as fi:
    for line in fi:
        x = line.split(',')[1]
        if x not in files_dict:
            files_dict[x] = open(x + '.csv', 'w')
        files_dict[x].write(line)

for file in files_dict.values():
    file.close()

Answer 3

您还可以将 pandas 用于大型 csv，因为它处理得很好，然后只需遍历 pandas 列：

df = pd.read_csv('fun.txt', header=None)

string = "tester string"

for row in df[1]:
    fo = row + '.csv'
    f = open(fo, 'a')
    f.write(string+'\n')
    f.close()

输出为 2 个文件，431.csv 和 rr1.csv。 431.csv的内容：

tester string

rr1.csv的内容：

tester string
tester string
tester string

它会将任何添加的信息附加到重复文件，我觉得这是基于您的伪代码的所需行为。这是一个很好的解决方案，因为它会在循环遍历该列时打开并关闭您的文件。这样您就不会同时打开 50 个文件，这会给您的 os.

带来麻烦

将数据拆分为多个文件：如何处理（未知数量的）多个连接

Split data into multiple files: how to handle (unknown number of) multiple connections

python

file-io

split