pandas 循环遍历 DataFrame 并仅保留指定的列标题。如果指定的标题不在 DataFrame 中，则会出现错误结果

Question

我想在 python 中使用 pandas 循环遍历多个 DataFrame 并仅保留来自指定 keep_col 列表的标题。如果 DataFrame 不包含指定的标题（KeyError：“['str2'] not in index”），代码将导致错误。

以下 pandas 代码创建了 2 个具有不同列标题名称的示例 DataFrame：

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(2,5), columns=('A','B','str1','str2','str3'))
df2 = pd.DataFrame(np.random.randn(2,3), columns=('A','B','str1'))
print df1
print df2

输出数据帧

 A         B         str1      str2      str3
-0.152686  0.189076 -1.079168 -0.823674  1.489668
-1.272144  0.694862  0.036248  0.319550  0.782666

 A         B         str1
 0.310152  1.302962 -0.284632
 1.046044  0.090650  0.861716

下面的代码会导致错误，因为 'str2' 不在 'df2' 中。

如果 'keep_col' 列表字符串不在 DataFrame 标题中，如何修改它以忽略它？

#delete columns
keep_col = ['A','str2'] #need code here to ignore 'str2' when generating 'df2'
new_df1 = df1[keep_col] 
new_df2 = df2[keep_col]

print new_df1
print new_df2

这是期望的输出：

 A          str2    
-0.152686  -0.823674
-1.272144   0.319550

 A       
 0.310152  
 1.046044

这个例子是为了简单起见。我将遍历 100 多个 .csv 文件以仅保留指定的列。

Answer 1

您可以将 filter() 函数与 RegEx 结合使用：

In [79]: mask = r'^(?:A|str2)$'

In [80]: df1.filter(regex=mask)
Out[80]:
          A      str2
0 -1.190226 -0.123637
1 -1.782685  0.219820

In [81]: df2.filter(regex=mask)
Out[81]:
          A
0  0.207736
1 -0.013273

Answer 2

您可以使用列表推导生成 keep_col 中所有列 headers 的列表。

new_df1 = df1[[c for c in df1.columns if c in keep_col]]
new_df2 = df1[[c for c in df2.columns if c in keep_col]]

print new_df1
>>>
          A      str2
0  1.480978  0.369485
1 -0.969107  0.767707

print new_df2
>>>
          A
0  1.480978
1 -0.969107

pandas 循环遍历 DataFrame 并仅保留指定的列标题。如果指定的标题不在 DataFrame 中，则会出现错误结果

pandas to loop through DataFrames and keep only specified column headings. Error results if specified heading is not in DataFrame

python

multiple-columns

heading

pandas