当指定 index_col 时,Pandas ExcelFile.parse 在索引中有 NaN

Pandas ExcelFile.parse has NaNs in index when index_col is specified

我有一个 excel 文件,我正在将其读入 pandas DataFrame,该文件在第 1 行(python 索引)上有 header 并且中间有一个空行header 和数据。当我指定 index_col 时,它将空白行视为索引的一部分作为 NaN。避免这种行为的最佳方法是什么?

测试文件:

idx value

a   1

不指定 index_col:

print xs.parse(header = 1)
   idx  value
0  NaN    NaN
1    a      1

print xs.parse(header = 1).index
Int64Index([0, 1], dtype='int64')

指定索引列:

print xs.parse(header = 1, index_col = 0)
     value
idx       
NaN    NaN
a        1

print xs.parse(header = 1, index_col = 0).index
Index([nan, u'a'], dtype='object')

你可以通过 skiprows=[1] 来跳过空行,我在虚拟 xl sheet 上测试了这个,见 ExcelFile.parse:

In [44]:

xs = pd.ExcelFile(r'c:\data\book1.xls')
xs.parse(skiprows=[1])

Out[44]:
   idx  value
0   12    NaN
1    2    NaN
2    1    NaN

比较:

In [45]:

xs = pd.ExcelFile(r'c:\data\book1.xls')
xs.parse()

Out[45]:
   idx  value
0  NaN    NaN
1   12    NaN
2    2    NaN
3    1    NaN

In [47]:

xs = pd.ExcelFile(r'c:\data\book1.xls')
xs.parse(skiprows=[1], header=0)
Out[47]:
   idx  value
0   12    NaN
1    2    NaN
2    1    NaN
In [49]:

xs = pd.ExcelFile(r'c:\data\book1.xls')
xs.parse(skiprows=[1], header=0, index_col=0)
Out[49]:
     value
idx       
12     NaN
2      NaN
1      NaN
In [50]:

xs = pd.ExcelFile(r'c:\data\book1.xls')
xs.parse(header=0, index_col=0)
Out[50]:
     value
idx       
NaN    NaN
 12    NaN
 2     NaN
 1     NaN