Python Pandas 多索引:保持相同长度的 level=1 与所有 level=0 索引
Python Pandas Multi-index: keeping same length of level=1 with all level=0 indexes
我有一个 df_ver1 具有多索引索引。我想删除所有级别 [1] 长度与 2 不同的行。下面是我的数据框。
In [13]: df_ver1
Out[13]:
key nm 0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
foo one 1.075770 -0.109050 1.643563 -1.469388
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
oof two 1.340309 -1.187678 -2.211372 0.380396
我的理想输出是
In [13]: df_ver1_fixed
Out[13]:
key nm 0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
如您所见,我想删除只有 1 级 [1] 索引的行。换句话说,我需要在第二级有 'one' 和 'two' 索引。是否有 pythonic 方法来执行此步骤?谢谢!
我认为你需要:
#filter only one and two values by second level
df = df.loc[pd.IndexSlice[:, ['one','two']], :]
#filter by length
df = df[df.groupby(level=0)[df.columns[0]].transform('size') == 2]
print (df)
0 1 2 3
key nm
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
另一个解决方案是比较集合:
mask = df.reset_index()
.groupby('key')['nm']
.transform(lambda x: set(x) == set(['one','two']))
.values
df = df[mask]
print (df)
0 1 2 3
key nm
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
这也行。其实可以按multi-indexkey
分组,过滤掉分组长度不等于2.
df.groupby(by='key').filter(lambda x: len(x) == 2) # keep groups with len 2
正如@Zero 所建议的,我们可以更具体地使用以下内容来指定满足要求的变量集,set(['one', 'two'])
。
df.groupby(by='key').filter(
lambda x: set(x.index.get_level_values('nm')) == set(['one', 'two']))
key nm 0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
另一种方法:使用多索引选择
sz = df.groupby("key").size()
indexes = sz[sz == 2].index.tolist() # first-level indexes that we want.
df.loc[indexes] # use loc for selection
key nm 0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
我有一个 df_ver1 具有多索引索引。我想删除所有级别 [1] 长度与 2 不同的行。下面是我的数据框。
In [13]: df_ver1
Out[13]:
key nm 0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
foo one 1.075770 -0.109050 1.643563 -1.469388
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
oof two 1.340309 -1.187678 -2.211372 0.380396
我的理想输出是
In [13]: df_ver1_fixed
Out[13]:
key nm 0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
如您所见,我想删除只有 1 级 [1] 索引的行。换句话说,我需要在第二级有 'one' 和 'two' 索引。是否有 pythonic 方法来执行此步骤?谢谢!
我认为你需要:
#filter only one and two values by second level
df = df.loc[pd.IndexSlice[:, ['one','two']], :]
#filter by length
df = df[df.groupby(level=0)[df.columns[0]].transform('size') == 2]
print (df)
0 1 2 3
key nm
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
另一个解决方案是比较集合:
mask = df.reset_index()
.groupby('key')['nm']
.transform(lambda x: set(x) == set(['one','two']))
.values
df = df[mask]
print (df)
0 1 2 3
key nm
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
这也行。其实可以按multi-indexkey
分组,过滤掉分组长度不等于2.
df.groupby(by='key').filter(lambda x: len(x) == 2) # keep groups with len 2
正如@Zero 所建议的,我们可以更具体地使用以下内容来指定满足要求的变量集,set(['one', 'two'])
。
df.groupby(by='key').filter(
lambda x: set(x.index.get_level_values('nm')) == set(['one', 'two']))
key nm 0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
另一种方法:使用多索引选择
sz = df.groupby("key").size()
indexes = sz[sz == 2].index.tolist() # first-level indexes that we want.
df.loc[indexes] # use loc for selection
key nm 0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061