在多索引数据框中删除行?
Dropping rows in a multi-index data frame?
我有这个 df:
temp = pd.DataFrame({'tic': ['IBM', 'AAPL', 'AAPL', 'IBM', 'AAPL'],
'industry': ['A', 'B', 'B', 'A', 'B'],
'price': [np.nan, 5, 6, 11, np.nan],
'shares':[100, 60, np.nan, 100, np.nan],
'dates': pd.to_datetime(['1990-01-01', '1990-01-01', '1990-04-01',
'1990-04-01', '1990-08-01'])
})
temp.set_index(['tic', 'dates'], inplace=True)
temp.sort_index(inplace=True)
产生:
industry price shares
tic dates
AAPL 1990-01-01 B 5.0 60.0
1990-04-01 B 6.0 NaN
1990-08-01 B NaN NaN
IBM 1990-01-01 A NaN 100.0
1990-04-01 A 11.0 100.0
如何在数据框中创建一个 new column
来显示每个抽动点的观察次数。因此,新专栏将如下所示:
New column
AAPL ... 3
... 3
... 3
IBM ... 2
... 2
你可以使用.groupby(level=0) and .filter()方法:
In [79]: temp.groupby(level=0).filter(lambda x: len(x) >= 3)
Out[79]:
industry price shares
tic dates
AAPL 1990-01-01 B 5.0 60.0
1990-04-01 B 6.0 NaN
1990-08-01 B NaN NaN
回答你的第二个问题:
In [83]: temp['new'] = temp.groupby(level=0)['industry'].transform('size')
In [84]: temp
Out[84]:
industry price shares new
tic dates
AAPL 1990-01-01 B 5.0 60.0 3
1990-04-01 B 6.0 NaN 3
1990-08-01 B NaN NaN 3
IBM 1990-01-01 A NaN 100.0 2
1990-04-01 A 11.0 100.0 2
我有这个 df:
temp = pd.DataFrame({'tic': ['IBM', 'AAPL', 'AAPL', 'IBM', 'AAPL'],
'industry': ['A', 'B', 'B', 'A', 'B'],
'price': [np.nan, 5, 6, 11, np.nan],
'shares':[100, 60, np.nan, 100, np.nan],
'dates': pd.to_datetime(['1990-01-01', '1990-01-01', '1990-04-01',
'1990-04-01', '1990-08-01'])
})
temp.set_index(['tic', 'dates'], inplace=True)
temp.sort_index(inplace=True)
产生:
industry price shares
tic dates
AAPL 1990-01-01 B 5.0 60.0
1990-04-01 B 6.0 NaN
1990-08-01 B NaN NaN
IBM 1990-01-01 A NaN 100.0
1990-04-01 A 11.0 100.0
如何在数据框中创建一个 new column
来显示每个抽动点的观察次数。因此,新专栏将如下所示:
New column
AAPL ... 3
... 3
... 3
IBM ... 2
... 2
你可以使用.groupby(level=0) and .filter()方法:
In [79]: temp.groupby(level=0).filter(lambda x: len(x) >= 3)
Out[79]:
industry price shares
tic dates
AAPL 1990-01-01 B 5.0 60.0
1990-04-01 B 6.0 NaN
1990-08-01 B NaN NaN
回答你的第二个问题:
In [83]: temp['new'] = temp.groupby(level=0)['industry'].transform('size')
In [84]: temp
Out[84]:
industry price shares new
tic dates
AAPL 1990-01-01 B 5.0 60.0 3
1990-04-01 B 6.0 NaN 3
1990-08-01 B NaN NaN 3
IBM 1990-01-01 A NaN 100.0 2
1990-04-01 A 11.0 100.0 2