pandas 使用切片布尔索引的子集
pandas subset using sliced boolean index
生成测试数据的代码:
import pandas as pd
import numpy as np
testdf = {'date': range(10),
'event': ['A', 'A', np.nan, 'B', 'B', 'A', 'B', np.nan, 'A', 'B'],
'id': [1] * 7 + [2] * 3}
testdf = pd.DataFrame(testdf)
print(testdf)
给予
date event id
0 0 A 1
1 1 A 1
2 2 NaN 1
3 3 B 1
4 4 B 1
5 5 A 1
6 6 B 1
7 7 NaN 2
8 8 A 2
9 9 B 2
子集 testdf
df_sub = testdf.loc[testdf.event == 'A',:]
print(df_sub)
date event id
0 0 A 1
1 1 A 1
5 5 A 1
8 8 A 2
(注:未重新索引)
创建条件布尔索引
bool_sliced_idx1 = df_sub.date < 4
bool_sliced_idx2 = (df_sub.date > 4) & (df_sub.date < 6)
我想在原始 df 中使用这个新索引插入条件值,例如
dftest[ 'new_column'] = np.nan
dftest.loc[bool_sliced_idx1, 'new_column'] = 'new_conditional_value'
这显然(现在)给出了错误:
pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
bool_sliced_idx1
长得像
>>> print(bool_sliced_idx1)
0 True
1 True
5 False
8 False
Name: date, dtype: bool
我试过 testdf.ix[(bool_sliced_idx1==True).index,:]
,但没用,因为
>>> (bool_sliced_idx1==True).index
Int64Index([0, 1, 5, 8], dtype='int64')
这有效
idx = np.where(bool_sliced_idx1==True)[0]
## or
# np.ravel(np.where(bool_sliced_idx1==True))
idx_original = df_sub.index[idx]
testdf.iloc[idx_original,:]
IIUC,您可以一次组合所有条件,而不是试图将它们链接起来。例如,df_sub.date < 4
实际上只是 (testdf.event == 'A') & (testdf.date < 4)
。所以,你可以这样做:
# Create the conditions.
cond1 = (testdf.event == 'A') & (testdf.date < 4)
cond2 = (testdf.event == 'A') & (testdf.date.between(4, 6, inclusive=False))
# Make the assignments.
testdf.loc[cond1, 'new_col'] = 'foo'
testdf.loc[cond2, 'new_col'] = 'bar'
哪个会给你:
date event id new_col
0 0 A 1 foo
1 1 A 1 foo
2 2 NaN 1 NaN
3 3 B 1 NaN
4 4 B 1 NaN
5 5 A 1 bar
6 6 B 1 NaN
7 7 NaN 2 NaN
8 8 A 2 NaN
9 9 B 2 NaN
生成测试数据的代码:
import pandas as pd
import numpy as np
testdf = {'date': range(10),
'event': ['A', 'A', np.nan, 'B', 'B', 'A', 'B', np.nan, 'A', 'B'],
'id': [1] * 7 + [2] * 3}
testdf = pd.DataFrame(testdf)
print(testdf)
给予
date event id
0 0 A 1
1 1 A 1
2 2 NaN 1
3 3 B 1
4 4 B 1
5 5 A 1
6 6 B 1
7 7 NaN 2
8 8 A 2
9 9 B 2
子集 testdf
df_sub = testdf.loc[testdf.event == 'A',:]
print(df_sub)
date event id
0 0 A 1
1 1 A 1
5 5 A 1
8 8 A 2
(注:未重新索引)
创建条件布尔索引
bool_sliced_idx1 = df_sub.date < 4
bool_sliced_idx2 = (df_sub.date > 4) & (df_sub.date < 6)
我想在原始 df 中使用这个新索引插入条件值,例如
dftest[ 'new_column'] = np.nan
dftest.loc[bool_sliced_idx1, 'new_column'] = 'new_conditional_value'
这显然(现在)给出了错误:
pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
bool_sliced_idx1
长得像
>>> print(bool_sliced_idx1)
0 True
1 True
5 False
8 False
Name: date, dtype: bool
我试过 testdf.ix[(bool_sliced_idx1==True).index,:]
,但没用,因为
>>> (bool_sliced_idx1==True).index
Int64Index([0, 1, 5, 8], dtype='int64')
这有效
idx = np.where(bool_sliced_idx1==True)[0]
## or
# np.ravel(np.where(bool_sliced_idx1==True))
idx_original = df_sub.index[idx]
testdf.iloc[idx_original,:]
IIUC,您可以一次组合所有条件,而不是试图将它们链接起来。例如,df_sub.date < 4
实际上只是 (testdf.event == 'A') & (testdf.date < 4)
。所以,你可以这样做:
# Create the conditions.
cond1 = (testdf.event == 'A') & (testdf.date < 4)
cond2 = (testdf.event == 'A') & (testdf.date.between(4, 6, inclusive=False))
# Make the assignments.
testdf.loc[cond1, 'new_col'] = 'foo'
testdf.loc[cond2, 'new_col'] = 'bar'
哪个会给你:
date event id new_col
0 0 A 1 foo
1 1 A 1 foo
2 2 NaN 1 NaN
3 3 B 1 NaN
4 4 B 1 NaN
5 5 A 1 bar
6 6 B 1 NaN
7 7 NaN 2 NaN
8 8 A 2 NaN
9 9 B 2 NaN