如何有效地将多个 pandas 列组合成一个类似数组的列?
How to efficiently combine multiple pandas columns into one array-like column?
创建(或加载)具有类似对象类型列的 DataFrame 很容易,如下所示:
[In]: pdf = pd.DataFrame({
"a": [1, 2, 3],
"b": [4, 5, 6],
"c": [7, 8, 9],
"combined": [[1, 4, 7], [2, 5, 8], [3, 6, 9]]}
)
[Out]
a b c combined
0 1 4 7 [1, 4, 7]
1 2 5 8 [2, 5, 8]
2 3 6 9 [3, 6, 9]
我目前所处的位置是,作为单独的列,我需要 return 作为单个列的值,并且需要非常有效地这样做。有没有一种快速有效的方法可以将列组合成一个对象类型的列?
在上面的示例中,这意味着已经有列 a
、b
和 c
,我希望创建 combined
.
我未能在网上找到类似的问题示例,如果这是重复的,请随时link。
我不确定它是否足够快,但您可以按如下方式使用 pandas.DataFrame.apply
with axis=1
(i.e. apply function to row) combined with pandas.Series.tolist
:
import pandas as pd
df = pd.DataFrame({"a":[1,2,3],"b":[4,5,6],"c":[7,8,9]})
df["combined"] = df.apply(pd.Series.tolist,axis=1)
print(df)
输出
a b c combined
0 1 4 7 [1, 4, 7]
1 2 5 8 [2, 5, 8]
2 3 6 9 [3, 6, 9]
import pandas as pd
df = pd.DataFrame({"a":[1,2,3],"b":[4,5,6],"c":[7,8,9]})
df['combined'] = [list(row) for _, row in df.iterrows()]
输出
a b c combined
0 1 4 7 [1, 4, 7]
1 2 5 8 [2, 5, 8]
2 3 6 9 [3, 6, 9]
使用 DataFrame.agg
并将列表作为聚合方法传递,使用 axis=1
,然后将其分配给新列
>>> pdf.assign(combined=pdf.agg(list, axis=1))
a b c combined
0 1 4 7 [1, 4, 7]
1 2 5 8 [2, 5, 8]
2 3 6 9 [3, 6, 9]
一个简单的解决方案是在需要合并的列上使用 pandas.DataFrame.apply
。所以像这样:
cols = ['a', 'b', 'c']
df['combined'] = df[cols].apply(lambda row: list(row.values), axis=1)
输出:
a b c combined
0 1 4 7 [1, 4, 7]
1 2 5 8 [2, 5, 8]
2 3 6 9 [3, 6, 9]
之后,您可以使用 pandarallel
(https://github.com/nalepae/pandarallel) 库 运行 并行应用:
from pandarallel import pandarallel
pandarallel.initialize()
cols = ['a', 'b', 'c']
df['combined'] = df[cols].parallel_apply(lambda row: list(row.values), axis=1)
这应该被证明是处理大量数据的最快方法。
在大数据上使用 numpy 比 rest 快得多
更新 -- 具有列表理解功能的 numpy 更快,仅需 0.77 秒
pdf['combined'] = [x for x in pdf[['a', 'b', 'c']].to_numpy()]
# pdf['combined'] = pdf[['a', 'b', 'c']].to_numpy().tolist()
速度比较
import pandas as pd
import sys
import time
def f1():
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
s0 = time.time()
pdf.assign(combined=pdf.agg(list, axis=1))
print(time.time() - s0)
def f2():
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
s0 = time.time()
pdf['combined'] = [x for x in pdf[['a', 'b', 'c']].to_numpy()]
# pdf['combined'] = pdf[['a', 'b', 'c']].to_numpy().tolist()
print(time.time() - s0)
def f3():
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
s0 = time.time()
cols = ['a', 'b', 'c']
pdf['combined'] = pdf[cols].apply(lambda row: list(row.values), axis=1)
print(time.time() - s0)
def f4():
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
s0 = time.time()
pdf["combined"] = pdf.apply(pd.Series.tolist,axis=1)
print(time.time() - s0)
if __name__ == '__main__':
eval(f'{sys.argv[1]}()')
➜ python test.py f1
17.766116857528687
➜ python test.py f2
0.7762737274169922
➜ python test.py f3
14.403311252593994
➜ python test.py f4
12.631694078445435
创建(或加载)具有类似对象类型列的 DataFrame 很容易,如下所示:
[In]: pdf = pd.DataFrame({
"a": [1, 2, 3],
"b": [4, 5, 6],
"c": [7, 8, 9],
"combined": [[1, 4, 7], [2, 5, 8], [3, 6, 9]]}
)
[Out]
a b c combined
0 1 4 7 [1, 4, 7]
1 2 5 8 [2, 5, 8]
2 3 6 9 [3, 6, 9]
我目前所处的位置是,作为单独的列,我需要 return 作为单个列的值,并且需要非常有效地这样做。有没有一种快速有效的方法可以将列组合成一个对象类型的列?
在上面的示例中,这意味着已经有列 a
、b
和 c
,我希望创建 combined
.
我未能在网上找到类似的问题示例,如果这是重复的,请随时link。
我不确定它是否足够快,但您可以按如下方式使用 pandas.DataFrame.apply
with axis=1
(i.e. apply function to row) combined with pandas.Series.tolist
:
import pandas as pd
df = pd.DataFrame({"a":[1,2,3],"b":[4,5,6],"c":[7,8,9]})
df["combined"] = df.apply(pd.Series.tolist,axis=1)
print(df)
输出
a b c combined
0 1 4 7 [1, 4, 7]
1 2 5 8 [2, 5, 8]
2 3 6 9 [3, 6, 9]
import pandas as pd
df = pd.DataFrame({"a":[1,2,3],"b":[4,5,6],"c":[7,8,9]})
df['combined'] = [list(row) for _, row in df.iterrows()]
输出
a b c combined
0 1 4 7 [1, 4, 7]
1 2 5 8 [2, 5, 8]
2 3 6 9 [3, 6, 9]
使用 DataFrame.agg
并将列表作为聚合方法传递,使用 axis=1
,然后将其分配给新列
>>> pdf.assign(combined=pdf.agg(list, axis=1))
a b c combined
0 1 4 7 [1, 4, 7]
1 2 5 8 [2, 5, 8]
2 3 6 9 [3, 6, 9]
一个简单的解决方案是在需要合并的列上使用 pandas.DataFrame.apply
。所以像这样:
cols = ['a', 'b', 'c']
df['combined'] = df[cols].apply(lambda row: list(row.values), axis=1)
输出:
a b c combined
0 1 4 7 [1, 4, 7]
1 2 5 8 [2, 5, 8]
2 3 6 9 [3, 6, 9]
之后,您可以使用 pandarallel
(https://github.com/nalepae/pandarallel) 库 运行 并行应用:
from pandarallel import pandarallel
pandarallel.initialize()
cols = ['a', 'b', 'c']
df['combined'] = df[cols].parallel_apply(lambda row: list(row.values), axis=1)
这应该被证明是处理大量数据的最快方法。
在大数据上使用 numpy 比 rest 快得多
更新 -- 具有列表理解功能的 numpy 更快,仅需 0.77 秒
pdf['combined'] = [x for x in pdf[['a', 'b', 'c']].to_numpy()]
# pdf['combined'] = pdf[['a', 'b', 'c']].to_numpy().tolist()
速度比较
import pandas as pd
import sys
import time
def f1():
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
s0 = time.time()
pdf.assign(combined=pdf.agg(list, axis=1))
print(time.time() - s0)
def f2():
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
s0 = time.time()
pdf['combined'] = [x for x in pdf[['a', 'b', 'c']].to_numpy()]
# pdf['combined'] = pdf[['a', 'b', 'c']].to_numpy().tolist()
print(time.time() - s0)
def f3():
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
s0 = time.time()
cols = ['a', 'b', 'c']
pdf['combined'] = pdf[cols].apply(lambda row: list(row.values), axis=1)
print(time.time() - s0)
def f4():
pdf = pd.DataFrame({"a": [1, 2, 3]*1000000, "b": [4, 5, 6]*1000000, "c": [7, 8, 9]*1000000})
s0 = time.time()
pdf["combined"] = pdf.apply(pd.Series.tolist,axis=1)
print(time.time() - s0)
if __name__ == '__main__':
eval(f'{sys.argv[1]}()')
➜ python test.py f1
17.766116857528687
➜ python test.py f2
0.7762737274169922
➜ python test.py f3
14.403311252593994
➜ python test.py f4
12.631694078445435