删除满足条件的值加上 pandas DataFrame 中任意数量的下一个值
Drop values satisfying condition plus arbitrary number of next values in a pandas DataFrame
所以我的最终目标是根据相同 DataFrame
、plus 的另一列的某些条件删除 pandas
DataFrame
的一列中的值 几个下一个值例如:
import pandas as pd
df = pd.DataFrame({'a': [0, 0.5, 0.2, 0, 0, 0, 0, 0.2, 0, 0, 0, 0.1, 0,],
'b': [0.1, -0.5, -0.3, None, 100., 0.2, 0.1, None, -0.3, -0.3, None, None, None]},
index=pd.date_range('2015/1/1', freq='D', periods=13))
df.loc[df['a'] > 0, 'b'] = None
print df
结果:
a b
2015-01-01 0.0 0.1
2015-01-02 0.5 NaN
2015-01-03 0.2 NaN
2015-01-04 0.0 NaN
2015-01-05 0.0 100.0
2015-01-06 0.0 0.2
2015-01-07 0.0 0.1
2015-01-08 0.2 NaN
2015-01-09 0.0 -0.3
2015-01-10 0.0 -0.3
2015-01-11 0.0 NaN
2015-01-12 0.1 NaN
2015-01-13 0.0 NaN
所以这将删除满足条件的记录,但是如何在满足条件后删除接下来的 3 条记录?我想要的输出看起来像这样:
a b
2015-01-01 0.0 0.1
2015-01-02 0.5 NaN
2015-01-03 0.2 NaN
2015-01-04 0.0 NaN
2015-01-05 0.0 NaN
2015-01-06 0.0 NaN
2015-01-07 0.0 0.1
2015-01-08 0.2 NaN
2015-01-09 0.0 NaN
2015-01-10 0.0 NaN
2015-01-11 0.0 NaN
2015-01-12 0.1 NaN
2015-01-13 0.0 NaN
请注意,可能有连续的 a > 0。
[编辑]:我似乎找到了解决方案:
for pos, i in df.iterrows():
if pd.isnull(i['a']):
pass
elif i['a'] > 0:
df['b'].ix[pos:pos+3] = None
else:
pass
这相当慢。所以,欢迎任何建议。
我们可以使用布尔条件索引使用 loc
对 df 进行切片并设置以下值:
In [392]:
# take the first value of the index
idx = (df['a'] > 0).index[0]
idx
Out[392]:
Timestamp('2015-01-01 00:00:00', offset='D')
In [393]:
# we have to offset the range by 1 at begin and end points
df.loc[idx+1:idx+4,'b'] = None
df
Out[393]:
a b
2015-01-01 0.0 0.1
2015-01-02 0.5 NaN
2015-01-03 0.0 NaN
2015-01-04 0.0 NaN
2015-01-05 0.0 NaN
编辑
这是一种替代方法,扩展了上述适用于您的原始编辑数据的答案,新方法使用相同的原理,但我们必须从索引值构造时间戳,以便我们可以抵消它:
In [39]:
idx = df[df.a > 0].index
for index in idx:
df.loc[pd.Timestamp(index, offset='D'):pd.Timestamp(index, offset='D') + 3,'b']=None
df
Out[39]:
a b
2015-01-01 0.0 0.1
2015-01-02 0.5 NaN
2015-01-03 0.2 NaN
2015-01-04 0.0 NaN
2015-01-05 0.0 NaN
2015-01-06 0.0 NaN
2015-01-07 0.0 0.1
2015-01-08 0.2 NaN
2015-01-09 0.0 NaN
2015-01-10 0.0 NaN
2015-01-11 0.0 NaN
2015-01-12 0.1 NaN
2015-01-13 0.0 NaN
然而时间显示您的方法速度是原来的两倍,不清楚我的方法是否会扩展得更好,因为它取决于数据的大小和分布。
所以我的最终目标是根据相同 DataFrame
、plus 的另一列的某些条件删除 pandas
DataFrame
的一列中的值 几个下一个值例如:
import pandas as pd
df = pd.DataFrame({'a': [0, 0.5, 0.2, 0, 0, 0, 0, 0.2, 0, 0, 0, 0.1, 0,],
'b': [0.1, -0.5, -0.3, None, 100., 0.2, 0.1, None, -0.3, -0.3, None, None, None]},
index=pd.date_range('2015/1/1', freq='D', periods=13))
df.loc[df['a'] > 0, 'b'] = None
print df
结果:
a b
2015-01-01 0.0 0.1
2015-01-02 0.5 NaN
2015-01-03 0.2 NaN
2015-01-04 0.0 NaN
2015-01-05 0.0 100.0
2015-01-06 0.0 0.2
2015-01-07 0.0 0.1
2015-01-08 0.2 NaN
2015-01-09 0.0 -0.3
2015-01-10 0.0 -0.3
2015-01-11 0.0 NaN
2015-01-12 0.1 NaN
2015-01-13 0.0 NaN
所以这将删除满足条件的记录,但是如何在满足条件后删除接下来的 3 条记录?我想要的输出看起来像这样:
a b
2015-01-01 0.0 0.1
2015-01-02 0.5 NaN
2015-01-03 0.2 NaN
2015-01-04 0.0 NaN
2015-01-05 0.0 NaN
2015-01-06 0.0 NaN
2015-01-07 0.0 0.1
2015-01-08 0.2 NaN
2015-01-09 0.0 NaN
2015-01-10 0.0 NaN
2015-01-11 0.0 NaN
2015-01-12 0.1 NaN
2015-01-13 0.0 NaN
请注意,可能有连续的 a > 0。
[编辑]:我似乎找到了解决方案:
for pos, i in df.iterrows():
if pd.isnull(i['a']):
pass
elif i['a'] > 0:
df['b'].ix[pos:pos+3] = None
else:
pass
这相当慢。所以,欢迎任何建议。
我们可以使用布尔条件索引使用 loc
对 df 进行切片并设置以下值:
In [392]:
# take the first value of the index
idx = (df['a'] > 0).index[0]
idx
Out[392]:
Timestamp('2015-01-01 00:00:00', offset='D')
In [393]:
# we have to offset the range by 1 at begin and end points
df.loc[idx+1:idx+4,'b'] = None
df
Out[393]:
a b
2015-01-01 0.0 0.1
2015-01-02 0.5 NaN
2015-01-03 0.0 NaN
2015-01-04 0.0 NaN
2015-01-05 0.0 NaN
编辑
这是一种替代方法,扩展了上述适用于您的原始编辑数据的答案,新方法使用相同的原理,但我们必须从索引值构造时间戳,以便我们可以抵消它:
In [39]:
idx = df[df.a > 0].index
for index in idx:
df.loc[pd.Timestamp(index, offset='D'):pd.Timestamp(index, offset='D') + 3,'b']=None
df
Out[39]:
a b
2015-01-01 0.0 0.1
2015-01-02 0.5 NaN
2015-01-03 0.2 NaN
2015-01-04 0.0 NaN
2015-01-05 0.0 NaN
2015-01-06 0.0 NaN
2015-01-07 0.0 0.1
2015-01-08 0.2 NaN
2015-01-09 0.0 NaN
2015-01-10 0.0 NaN
2015-01-11 0.0 NaN
2015-01-12 0.1 NaN
2015-01-13 0.0 NaN
然而时间显示您的方法速度是原来的两倍,不清楚我的方法是否会扩展得更好,因为它取决于数据的大小和分布。