如何获取多个未定义列不为空的 pandas DataFrame 的第一个索引?
How to get the first index of a pandas DataFrame for which several undefined columns are not null?
我有一个包含多列的数据框。我想获取第一行索引:
- A 列的值不为空
- 至少有 n 个其他列的值不为空
示例:如果我的数据框是:
Date A B C D
0 2015-01-02 NaN 1 1 NaN
1 2015-01-02 NaN 2 2 NaN
2 2015-01-02 NaN 3 3 NaN
3 2015-01-02 1 NaN 4 NaN
5 2015-01-02 NaN 2 NaN NaN
6 2015-01-03 1 NaN 6 NaN
7 2015-01-03 1 1 6 NaN
8 2015-01-03 1 1 6 8
如果 n=1 我会得到 3
如果 n=2 我会得到 7
如果 n=3 我会得到 8
您可以先 select A
而不是 NaN
和列 loc
,然后每行 sum
[=20] =] 值减去 1
列 A
.
上次使用布尔掩码 idxmax
:
a = df.loc[df['A'].notnull(), 'A':].notnull().sum(axis=1).sub(1)
print (a)
3 1
6 1
7 2
8 3
dtype: int64
N = 1
print ((a == N).idxmax())
3
N = 2
print ((a == N).idxmax())
7
N = 3
print ((a == N).idxmax())
8
print (df.loc[df['A'].notnull(), 'A':])
A B C D
3 1.0 NaN 4.0 NaN
6 1.0 NaN 6.0 NaN
7 1.0 1.0 6.0 NaN
8 1.0 1.0 6.0 8.0
这是一种一次性获取不同 n's
索引的方法 -
def numpy_approach(df, reference='A'):
df0 = df.iloc[:,df.columns != 'Date']
valid_mask = df0.columns != reference
mask = ~np.isnan(df0.values)
count = mask[:,valid_mask].sum(1) * mask[:,(~valid_mask).argmax()]
idx0 = np.searchsorted(np.maximum.accumulate(count),[1,2,3])
return df.index[idx0]
样本运行 -
In [555]: df
Out[555]:
Date A B C D
0 2015-01-02 NaN 1.0 1.0 NaN
1 2015-01-02 NaN 2.0 2.0 NaN
2 2015-01-02 NaN 3.0 3.0 NaN
3 2015-01-02 1.0 NaN 4.0 NaN
5 2015-01-02 NaN 2.0 NaN NaN
6 2015-01-03 1.0 NaN 6.0 NaN
7 2015-01-03 1.0 1.0 6.0 NaN
8 2015-01-03 1.0 1.0 6.0 8.0
In [556]: numpy_approach(df, reference='A')
Out[556]: Int64Index([3, 7, 8], dtype='int64')
In [557]: numpy_approach(df, reference='B')
Out[557]: Int64Index([0, 7, 8], dtype='int64')
In [558]: numpy_approach(df, reference='C')
Out[558]: Int64Index([0, 7, 8], dtype='int64')
In [568]: numpy_approach(df, reference='D')
Out[568]: Int64Index([8, 8, 8], dtype='int64')
我有一个包含多列的数据框。我想获取第一行索引:
- A 列的值不为空
- 至少有 n 个其他列的值不为空
示例:如果我的数据框是:
Date A B C D
0 2015-01-02 NaN 1 1 NaN
1 2015-01-02 NaN 2 2 NaN
2 2015-01-02 NaN 3 3 NaN
3 2015-01-02 1 NaN 4 NaN
5 2015-01-02 NaN 2 NaN NaN
6 2015-01-03 1 NaN 6 NaN
7 2015-01-03 1 1 6 NaN
8 2015-01-03 1 1 6 8
如果 n=1 我会得到 3
如果 n=2 我会得到 7
如果 n=3 我会得到 8
您可以先 select A
而不是 NaN
和列 loc
,然后每行 sum
[=20] =] 值减去 1
列 A
.
上次使用布尔掩码 idxmax
:
a = df.loc[df['A'].notnull(), 'A':].notnull().sum(axis=1).sub(1)
print (a)
3 1
6 1
7 2
8 3
dtype: int64
N = 1
print ((a == N).idxmax())
3
N = 2
print ((a == N).idxmax())
7
N = 3
print ((a == N).idxmax())
8
print (df.loc[df['A'].notnull(), 'A':])
A B C D
3 1.0 NaN 4.0 NaN
6 1.0 NaN 6.0 NaN
7 1.0 1.0 6.0 NaN
8 1.0 1.0 6.0 8.0
这是一种一次性获取不同 n's
索引的方法 -
def numpy_approach(df, reference='A'):
df0 = df.iloc[:,df.columns != 'Date']
valid_mask = df0.columns != reference
mask = ~np.isnan(df0.values)
count = mask[:,valid_mask].sum(1) * mask[:,(~valid_mask).argmax()]
idx0 = np.searchsorted(np.maximum.accumulate(count),[1,2,3])
return df.index[idx0]
样本运行 -
In [555]: df
Out[555]:
Date A B C D
0 2015-01-02 NaN 1.0 1.0 NaN
1 2015-01-02 NaN 2.0 2.0 NaN
2 2015-01-02 NaN 3.0 3.0 NaN
3 2015-01-02 1.0 NaN 4.0 NaN
5 2015-01-02 NaN 2.0 NaN NaN
6 2015-01-03 1.0 NaN 6.0 NaN
7 2015-01-03 1.0 1.0 6.0 NaN
8 2015-01-03 1.0 1.0 6.0 8.0
In [556]: numpy_approach(df, reference='A')
Out[556]: Int64Index([3, 7, 8], dtype='int64')
In [557]: numpy_approach(df, reference='B')
Out[557]: Int64Index([0, 7, 8], dtype='int64')
In [558]: numpy_approach(df, reference='C')
Out[558]: Int64Index([0, 7, 8], dtype='int64')
In [568]: numpy_approach(df, reference='D')
Out[568]: Int64Index([8, 8, 8], dtype='int64')