如何有效地将多个 pandas 列组合成一个类似数组的列?

How to efficiently combine multiple pandas columns into one array-like column?

创建(或加载)具有类似对象类型列的 DataFrame 很容易,如下所示:

[In]: pdf = pd.DataFrame({
                     "a": [1, 2, 3], 
                     "b": [4, 5, 6], 
                     "c": [7, 8, 9], 
                     "combined": [[1, 4, 7], [2, 5, 8], [3, 6, 9]]}
      )

[Out]
   a  b  c   combined
0  1  4  7  [1, 4, 7]
1  2  5  8  [2, 5, 8]
2  3  6  9  [3, 6, 9]

我目前所处的位置是,作为单独的列,我需要 return 作为单个列的值,并且需要非常有效地这样做。有没有一种快速有效的方法可以将列组合成一个对象类型的列?

在上面的示例中,这意味着已经有列 abc,我希望创建 combined.

我未能在网上找到类似的问题示例,如果这是重复的,请随时link。

我不确定它是否足够快,但您可以按如下方式使用 pandas.DataFrame.apply with axis=1 (i.e. apply function to row) combined with pandas.Series.tolist

import pandas as pd
df = pd.DataFrame({"a":[1,2,3],"b":[4,5,6],"c":[7,8,9]})
df["combined"] = df.apply(pd.Series.tolist,axis=1)
print(df)

输出

   a  b  c   combined
0  1  4  7  [1, 4, 7]
1  2  5  8  [2, 5, 8]
2  3  6  9  [3, 6, 9]
import pandas as pd
df = pd.DataFrame({"a":[1,2,3],"b":[4,5,6],"c":[7,8,9]})
df['combined'] = [list(row) for _, row in df.iterrows()]

输出

   a  b  c   combined
0  1  4  7  [1, 4, 7]
1  2  5  8  [2, 5, 8]
2  3  6  9  [3, 6, 9]

使用 DataFrame.agg 并将列表作为聚合方法传递,使用 axis=1,然后将其分配给新列

>>> pdf.assign(combined=pdf.agg(list, axis=1))

   a  b  c   combined
0  1  4  7  [1, 4, 7]
1  2  5  8  [2, 5, 8]
2  3  6  9  [3, 6, 9]

一个简单的解决方案是在需要合并的列上使用 pandas.DataFrame.apply。所以像这样:

cols = ['a', 'b', 'c']
df['combined'] = df[cols].apply(lambda row: list(row.values), axis=1)

输出:

    a   b   c   combined
0   1   4   7   [1, 4, 7]
1   2   5   8   [2, 5, 8]
2   3   6   9   [3, 6, 9]

之后,您可以使用 pandarallel (https://github.com/nalepae/pandarallel) 库 运行 并行应用:

from pandarallel import pandarallel
pandarallel.initialize()

cols = ['a', 'b', 'c']
df['combined'] = df[cols].parallel_apply(lambda row: list(row.values), axis=1)

这应该被证明是处理大量数据的最快方法。

在大数据上使用 numpy 比 rest 快得多

更新 -- 具有列表理解功能的 numpy 更快,仅需 0.77 秒

pdf['combined'] = [x for x in pdf[['a', 'b', 'c']].to_numpy()]
# pdf['combined'] = pdf[['a', 'b', 'c']].to_numpy().tolist()

速度比较

import pandas as pd
import sys
import time

def f1():
    pdf = pd.DataFrame({"a": [1, 2, 3]*1000000,  "b": [4, 5, 6]*1000000,  "c": [7, 8, 9]*1000000})
    s0 = time.time()
    pdf.assign(combined=pdf.agg(list, axis=1))
    print(time.time() - s0)

def f2():
    pdf = pd.DataFrame({"a": [1, 2, 3]*1000000,  "b": [4, 5, 6]*1000000,  "c": [7, 8, 9]*1000000})
    s0 = time.time()
    pdf['combined'] = [x for x in pdf[['a', 'b', 'c']].to_numpy()]
    # pdf['combined'] = pdf[['a', 'b', 'c']].to_numpy().tolist()
    print(time.time() - s0)

def f3():
    pdf = pd.DataFrame({"a": [1, 2, 3]*1000000,  "b": [4, 5, 6]*1000000,  "c": [7, 8, 9]*1000000})
    s0 = time.time()
    cols = ['a', 'b', 'c']
    pdf['combined'] = pdf[cols].apply(lambda row: list(row.values), axis=1)
    print(time.time() - s0)

def f4():
    pdf = pd.DataFrame({"a": [1, 2, 3]*1000000,  "b": [4, 5, 6]*1000000,  "c": [7, 8, 9]*1000000})
    s0 = time.time()
    pdf["combined"] = pdf.apply(pd.Series.tolist,axis=1)
    print(time.time() - s0)

if __name__ == '__main__':
    eval(f'{sys.argv[1]}()')
➜   python test.py f1
17.766116857528687
➜   python test.py f2
0.7762737274169922
➜   python test.py f3
14.403311252593994
➜   python test.py f4
12.631694078445435