PANDAS 中的高效列索引和选择

Question

我正在寻找从数据框中 select 多列的最有效方法：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(4,8), columns = list('abcdefgh'))

我只想 select 列以下列 a、c、e、f、g，这可以通过使用索引来完成：

df.ix[:,[0,2,4,5,6]]

对于包含许多列的大型数据框，这似乎是一种低效的方法，如果可能的话，我宁愿按范围指定连续的列索引，但尝试如下所示，都会引发语法错误：

df.ix[:,[0,2,4:6]]

或

df.ix[:,[0,2,[4:6]]]

Answer 1

我想你可以使用 range:

print [0,2] + range(4,7)
[0, 2, 4, 5, 6]


print df.ix[:, [0,2] + range(4,7)]
          a         c         e         f         g
0  0.278231  0.192650  0.653491  0.944689  0.663457
1  0.416367  0.477074  0.582187  0.730247  0.946496
2  0.396906  0.877941  0.774960  0.057290  0.556719
3  0.119685  0.211581  0.526096  0.213282  0.492261

Answer 2

Pandas比较好想，最短的路最有效率：

df[['a','c','e','f','g']]

您不需要 ix，因为它会在您的数据中进行搜索，但为此您显然需要列的名称。

Answer 3

只要你 select 非相邻列，你就会支付负载。
如果您的数据是同质的，回退到 numpy 会给您带来显着的改进。

In [147]: %timeit df[['a','c','e','f','g']]
          %timeit df.values[:,[0,2,4,5,6]]
          %timeit df.ix[:,[0,2,4,5,6]]
          %timeit pd.DataFrame(df.values[:,[0,2,4,5,6]],columns=df.columns[[0,2,4,5,6]])
100 loops, best of 3: 2.67 ms per loop
10000 loops, best of 3: 58.7 µs per loop
1000 loops, best of 3: 1.81 ms per loop 
1000 loops, best of 3: 568 µs per loop

PANDAS 中的高效列索引和选择

Efficient column indexing and selection in PANDAS

python

multiple-columns

pandas