为什么 DataFrames 的连接速度呈指数级下降？

Question

我有一个处理 DataFrame 的函数，主要是将数据处理到桶中，使用 pd.get_dummies(df[col]) 在特定列中创建特征的二进制矩阵。

为了避免使用此函数一次处理我的所有数据（内存不足并导致 iPython 崩溃），我使用以下方法将大型 DataFrame 分成块：

chunks = (len(df) / 10000) + 1
df_list = np.array_split(df, chunks)

pd.get_dummies(df) 将根据 df[col] 的内容自动创建新列，并且 df_list 中的每个 df 可能会有所不同。

处理后，我使用以下方法将数据帧连接在一起：

for i, df_chunk in enumerate(df_list):
    print "chunk", i
    [x, y] = preprocess_data(df_chunk)
    super_x = pd.concat([super_x, x], axis=0)
    super_y = pd.concat([super_y, y], axis=0)
    print datetime.datetime.utcnow()

第一个块的处理时间是完全可以接受的，但是，每个块都在增长！这与 preprocess_data(df_chunk) 无关，因为它没有理由增加。调用 pd.concat() 会导致时间增加吗？

请查看下面的日志：

chunks 6
chunk 0
2016-04-08 00:22:17.728849
chunk 1
2016-04-08 00:22:42.387693 
chunk 2
2016-04-08 00:23:43.124381
chunk 3
2016-04-08 00:25:30.249369
chunk 4
2016-04-08 00:28:11.922305
chunk 5
2016-04-08 00:32:00.357365

是否有解决方法来加快速度？我有 2900 个块要处理，因此非常感谢您的帮助！

接受 Python 中的任何其他建议！

Answer 1

每次连接时，都会返回数据的副本。

您想保留一个块列表，然后在最后一步连接所有内容。

df_x = []
df_y = []
for i, df_chunk in enumerate(df_list):
    print "chunk", i
    [x, y] = preprocess_data(df_chunk)
    df_x.append(x)
    df_y.append(y)

super_x = pd.concat(df_x, axis=0)
del df_x  # Free-up memory.
super_y = pd.concat(df_y, axis=0)
del df_y  # Free-up memory.

Answer 2

切勿在 for 循环内调用 DataFrame.append 或 pd.concat。它导致二次复制。

pd.concat returns 一个新的 DataFrame。 Space 必须分配给新的 DataFrame，并且必须将旧 DataFrame 中的数据复制到新 DataFrame 中数据框。考虑 for-loop 中这一行所需的复制量（假设每个 x 的大小为 1）：

super_x = pd.concat([super_x, x], axis=0)

| iteration | size of old super_x | size of x | copying required |
|         0 |                   0 |         1 |                1 |
|         1 |                   1 |         1 |                2 |
|         2 |                   2 |         1 |                3 |
|       ... |                     |           |                  |
|       N-1 |                 N-1 |         1 |                N |

1 + 2 + 3 + ... + N = N(N+1)/2。所以需要 O(N**2) 份完成循环。

现在考虑

super_x = []
for i, df_chunk in enumerate(df_list):
    [x, y] = preprocess_data(df_chunk)
    super_x.append(x)
super_x = pd.concat(super_x, axis=0)

Appending to a list is an O(1) operation并且不需要复制。现在循环完成后有一个 pd.concat 调用。这个调用 pd.concat 需要制作 N 个副本，因为 super_x 包含 N 大小为 1 的数据帧。因此，以这种方式构造时，super_x 需要 O(N) 副本。

为什么 DataFrames 的连接速度呈指数级下降？

Why does concatenation of DataFrames get exponentially slower?

python

performance

concatenation

processing-efficiency

pandas