使用多索引数据框的问题

Question

我有一个巨大的数据框。我试图在这里构建一个类似于它的多索引数据框。我需要根据每个索引和列获取 NaN 的数量。

temp = pd.DataFrame({'tic': ['IBM', 'AAPL', 'AAPL', 'IBM', 'AAPL'],
                   'industry': ['A', 'B', 'B', 'A', 'B'],
                    'price': [np.nan, 5, 6, 11, np.nan],
                    'shares':[100, 60, np.nan, 100, 62],
                    'dates': pd.to_datetime(['1990-01-01', '1990-01-01','1990-04-01', 
                                                 '1990-04-01', '1990-08-01'])
                    })

temp.set_index(['tic', 'dates'], inplace=True)

产生：

                industry  price  shares
tic  dates                             
IBM  1990-01-01        A    NaN   100.0
AAPL 1990-01-01        B    5.0    60.0
     1990-04-01        B    6.0     NaN
IBM  1990-04-01        A   11.0   100.0
AAPL 1990-08-01        B    NaN    62.0

问题如下：

1) 小问题：为什么索引不起作用？我期待在 tic 列中看到一个 IBM 和 AAPL。

2) 如何获得每列上每个 tic 的 NaN 与总数据点的比率？所以，我需要这样的数据框：

tic                                     IBM              AAPL 
number of total NaNs                    1                2 
percentage of NaNs in 'price' column    50%(1 out of 2)  33.3% (1 out 3)
percentage of NaNs in 'Shares' column   0% (0 out 2)     33.3% (1 out 3)

3) 如何根据 price 列中 NaN 的比率对抽动进行排名？

4) 如何 select 两列中 NaN 比例最低的前 n 个抽动。

5) 如何在两个日期之间执行上述操作？

Answer 1

1) 为什么索引不起作用？

temp.sort_index()

2) 如何获得 NaN 的比率？

grpd = temp.groupby(level='tic').agg(['size', 'count'])

null_ratio = grpd.xs('count', axis=1, level=1) \
        .div(grpd.xs('size', axis=1, level=1)).mul(-1).__radd__(1)

null_ratio

3) 按价格列中的空值排序？

null_ratio.price.rank()

tic
AAPL    1.0
IBM     2.0
Name: price, dtype: float64

4) 如何 select 两列中 NaN 比率最低的前 n 个 tic？

null_ratio.price.nsmallest(1)

tic
AAPL    0.333333
Name: price, dtype: float64

5) 日期之间

temp.sort_index().loc[pd.IndexSlice[:, '1990-01-01':'1990-04-01'], :]

Answer 2

您可以使用sort_level函数来实现您想要的顺序。
temp.sort_level('tic', inplace=True)
temp.sort_level(['tic', 'dates'], inplace=True)
df = pd.DataFrame({'total_missing': temp_grpd.apply(lambda x: x['price'].isnull().sum() + x['shares'].isnull().sum()), 'pnt_missing_price': temp_grpd.apply(lambda x: x['price'].isnull().sum()/x.shape[0]), 'pnt_missing_shares': temp_grpd.apply(lambda x: x['shares'].isnull().sum()/x.shape[0]), 'total_records': temp_grpd.apply(lambda x: x.shape[0])})

如果需要，您可以转置数据帧以匹配您在 post 中包含的格式，但使用这种格式可能更容易操作。

df['pnt_missing_price'].rank(ascending=False)
问题定义不明确。我想你可能需要像下面这样的东西，但还不清楚。

df['pnt_missing'] = df['total_missing']/df['total_records'] df.sort_values('pnt_missing', ascending=True) df.loc[df['pnt_missing'].nsmallest(5)]
piRSquared 已经为您提供了一个很好的答案。

使用多索引数据框的问题

Issues with working with multi-index data frames

data-manipulation

multi-index

pandas