在 Python 中转置 table 比压缩数组更有效？

Question

我一直在尝试在集群上转置我的 table 2000000 多行和 300 多列，但似乎我的 Python 脚本由于内存不足而被杀死。我只想知道除了使用数组之外，是否有人对存储我的 table 数据的更有效方法有任何建议，如下面的代码所示？

import sys
Seperator = "\t"
m = []
f = open(sys.argv[1], 'r')
data = f.read()
lines = data.split("\n")[:-1]
for line in lines:
    m.append(line.strip().split("\t"))
for i in zip(*m):
    for j in range(len(i)):
        if j != len(i):
            print(i[j] +Seperator)
        else:
            print(i[j])
    print ("\n")

非常感谢。

Answer 1

由于列数远小于行数，我会考虑将每一列写入单独的文件。然后将它们组合在一起。

import sys
Separator = "\t"
f = open(sys.argv[1], 'r')
for line in f:
   for i, c in enumerate(line.strip().split("\t")):
      dest = column_file[i]  # you shoud open 300+ file handlers, one for each column
      dest.write(c)
      dest.write(Separator)

# all you need to do after than is combine the content of you "row" files

Answer 2

如果无法将所有文件存储到内存中，可以读取 n 次：

column_number = 4  # if necessary, read the first line of the file to calculate it
seperetor = '\t'
filename = sys.argv[1]

def get_nth_column(filename, n):
    with open(filename, 'r') as file:
        for line in file:
            if line:  # remove empty lines
                yield line.strip().split('\t')[n]


for column in range(column_number):
    print(seperetor.join(get_nth_column(filename, column)))

请注意，如果文件格式不正确，将引发异常。有需要的可以抓。

读取文件时：使用 with 构造，以确保您的文件将被关闭。并直接在文件上迭代，而不是先读取内容。它更具可读性和效率。

Answer 3

首先要注意的是您对变量的处理不慎。您将一个大文件作为单个字符串加载到内存中，然后是一个字符串列表，然后是一个字符串列表列表，最后转置该列表。这将导致您在开始转置之前将文件中的所有数据存储三次。

如果文件中的每个字符串只有大约 10 个字符长，那么您将需要 18GB 的内存来存储它（2e6 行 * 300 列 * 10 字节 * 3 个重复项）。这是在您考虑 python 个对象的所有开销之前（每个字符串对象约 27 个字节）。

这里有几个选项。

通过为每个旧行读取一次文件并一次追加每个新行（牺牲时间效率）来逐步创建每个新的转置行。
为每个新行创建一个文件并在末尾合并这些行文件（牺牲磁盘 space 效率，如果由于数量限制在初始文件中有很多列，则可能会出现问题进程可能拥有的打开文件数）。

使用有限数量的打开文件进行移调

delimiter = ','

input_filename = 'file.csv'
output_filename = 'out.csv'

# find out the number of columns in the file
with open(input_filename) as input:
    old_cols = input.readline().count(delimiter) + 1

temp_files = [
    'temp-file-{}.csv'.format(i)
    for i in range(old_cols)
]

# create temp files
for temp_filename in temp_files:
    open(temp_filename, 'w') as output:
        output.truncate()
with open(input_filename) as input:
    for line in input:
        parts = line.rstrip().split(delimiter)
        assert len(parts) == len(temp_files), 'not enough or too many columns'
        for temp_filename, cell in zip(temp_files, parts):
            with open(temp_filename, 'a') as output:
                output.write(cell)
                output.write(',')

# combine temp files
with open(output_filename, 'w') as output:
    for temp_filename in temp_files:
        with open(temp_filename) as input:
            line = input.read().rstrip()[:-1] + '\n'
            output.write(line)

在 Python 中转置 table 比压缩数组更有效？

More efficient way than zipping arrays for transposing a table in Python?

python

arrays

transpose

使用有限数量的打开文件进行移调