基于组对 DataFrame 进行 Winsorize
Winsorize DataFrame based on Groups
我做了以下可复制的例子:
col1 = pd.Series(['2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31'])
col2 = pd.Series(['Discr','Discr','Discr','Discr','Discr','Discr', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv','Discr','Discr','Discr','Discr','Discr','Discr','Adv', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv'])
col3 = pd.Series(['Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond'])
col4 = pd.Series([5,3,200, 5,7,23,5,4,21,68,45,324,32,4,78,2,45,2,56,3,5,7,22,45])
Example = pd.DataFrame(data = pd.concat([col1,col2,col3,col4], axis=1))
Example.columns = ['Date', 'InType', 'AType', 'Value']
看起来如下:
我想通过首先对 'Date'、'Intype' 和 'Atype' 进行分组,在 1% 的水平上对 'Value' 列进行缩尾处理。例如,我要 winsorize 的第一组列的日期为 2016-04-30,Intype = Discr,AType = Eq。在这种情况下,我希望将 200 设置为等于 5。我想分别为所有组执行此操作。
这是我目前尝试过的方法:
def using_mstats_df(df):
return df.apply(using_mstats, axis=0)
def using_mstats(s):
return mstats.winsorize(s, limits=[0.0, 0.5])
grouped = Example.groupby(['Date', 'InType', 'AType'])
grouped.apply(using_mstats_df)
它似乎做了正确的事情,但是当我在我的实际(大)数据集上尝试它时,我得到一个非常大的错误,以
结尾
ValueError:无法从重复轴重新索引
有谁知道我可能做错了什么,或者我应该用不同的方式来做?
这是一个工作示例(我不是 100% 确定 Winsorizing)
import pandas as pd
import scipy.stats
col1 = pd.Series(['2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31'])
col2 = pd.Series(['Discr','Discr','Discr','Discr','Discr','Discr', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv','Discr','Discr','Discr','Discr','Discr','Discr','Adv', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv'])
col3 = pd.Series(['Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond'])
col4 = pd.Series([5,3,200, 5,7,23,5,4,21,68,45,324,32,4,78,2,45,2,56,3,5,7,22,45])
df = pd.DataFrame(data = pd.concat([col1,col2,col3,col4], axis=1))
df.columns = ['Date', 'InType', 'AType', 'Value']
# sort your df
df = df.sort_values(['Date', 'InType', 'AType'])
# empty list to store the values column after winsorization
winsorized_values = []
# winsorize every group
for name, group in df.groupby(['Date', 'InType', 'AType']):
winsorized_values.append(list(scipy.stats.mstats.winsorize(group.Value.values, limits=[0.01, 0.99])))
# append the winsorized values to dataframe, after flatening the list
df['winsorized_values'] = [item for sublist in winsorized_values for item in sublist]
我做了以下可复制的例子:
col1 = pd.Series(['2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31'])
col2 = pd.Series(['Discr','Discr','Discr','Discr','Discr','Discr', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv','Discr','Discr','Discr','Discr','Discr','Discr','Adv', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv'])
col3 = pd.Series(['Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond'])
col4 = pd.Series([5,3,200, 5,7,23,5,4,21,68,45,324,32,4,78,2,45,2,56,3,5,7,22,45])
Example = pd.DataFrame(data = pd.concat([col1,col2,col3,col4], axis=1))
Example.columns = ['Date', 'InType', 'AType', 'Value']
看起来如下:
我想通过首先对 'Date'、'Intype' 和 'Atype' 进行分组,在 1% 的水平上对 'Value' 列进行缩尾处理。例如,我要 winsorize 的第一组列的日期为 2016-04-30,Intype = Discr,AType = Eq。在这种情况下,我希望将 200 设置为等于 5。我想分别为所有组执行此操作。
这是我目前尝试过的方法:
def using_mstats_df(df):
return df.apply(using_mstats, axis=0)
def using_mstats(s):
return mstats.winsorize(s, limits=[0.0, 0.5])
grouped = Example.groupby(['Date', 'InType', 'AType'])
grouped.apply(using_mstats_df)
它似乎做了正确的事情,但是当我在我的实际(大)数据集上尝试它时,我得到一个非常大的错误,以
结尾ValueError:无法从重复轴重新索引
有谁知道我可能做错了什么,或者我应该用不同的方式来做?
这是一个工作示例(我不是 100% 确定 Winsorizing)
import pandas as pd
import scipy.stats
col1 = pd.Series(['2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31'])
col2 = pd.Series(['Discr','Discr','Discr','Discr','Discr','Discr', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv','Discr','Discr','Discr','Discr','Discr','Discr','Adv', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv'])
col3 = pd.Series(['Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond'])
col4 = pd.Series([5,3,200, 5,7,23,5,4,21,68,45,324,32,4,78,2,45,2,56,3,5,7,22,45])
df = pd.DataFrame(data = pd.concat([col1,col2,col3,col4], axis=1))
df.columns = ['Date', 'InType', 'AType', 'Value']
# sort your df
df = df.sort_values(['Date', 'InType', 'AType'])
# empty list to store the values column after winsorization
winsorized_values = []
# winsorize every group
for name, group in df.groupby(['Date', 'InType', 'AType']):
winsorized_values.append(list(scipy.stats.mstats.winsorize(group.Value.values, limits=[0.01, 0.99])))
# append the winsorized values to dataframe, after flatening the list
df['winsorized_values'] = [item for sublist in winsorized_values for item in sublist]