按时间和组别划分的 Z 分数
Z-score by time and group
我有一个具有以下数据样式的数据框
我正在尝试为 3 个因素(F1、F2、F3)的风格列中的每个公司每月计算 z 分数(标准化)
比如说 2014 年 8 月 31 日,我想计算当月该风格同行中每个公司在风格(比如建筑材料)中的 z 分数(分别为 F1、F2、F3)。再次针对 2014 年 8 月 31 日,我想计算该月每个 "Electronic Equipment, Instruments & Components" 的公司在样式(例如电子设备、仪器和组件)中的 z 分数。并且每个月重复这个过程。
回顾一下,首先从日期开始,然后计算每种风格的 z 分数,然后每个月重复一次。
我首先尝试定义 z-score zscr=lambda x: (x-x.mean())/x.std()
然后groupby by date, style却没有得到想要的结果。
提前致谢
Date Name Style ID \
0 8/31/2014 XYZ Construction Materials ABC
1 9/30/2014 XYZ Construction Materials ABC
2 10/31/2014 XYZ Construction Materials ABC
3 11/30/2014 XYZ Construction Materials ABC
4 8/31/2014 Acme Electronic Equipment, Instruments & Components KYZ
5 9/30/2014 Acme Electronic Equipment, Instruments & Components KYZ
6 10/31/2014 Acme Electronic Equipment, Instruments & Components KYZ
F1 F2 F3
0 0.032111 0.063330 0.027733
1 0.068824 0.158614 0.032489
2 0.076838 0.034735 0.020062
3 0.020903 0.154653 0.056860
4 0.032807 1.099790 0.233216
5 -0.014995 0.814866 0.498432
6 -0.002233 1.954578 0.727823
2014 年 8 月 31 日具有 3 个名称的样式建筑材料的详细示例
Date Name Style F1 F2 F3 Avg F1 Avg F2 Avg F3 Std F1 Std F2 Std F3 Zscore F1 Zscore F2 Zscore F3
8/31/2014 XYZ Construction Materials ABC 0.0321 0.0633 0.0277 0.0292 0.5066 0.3623 0.0219 0.5091 0.3078 0.131514468 -0.870730766 -1.087062133
8/31/2014 ABC Construction Materials XKSD 0.0495 0.3939 0.4258 0.0292 0.5066 0.3623 0.0219 0.5091 0.3078 0.927735574 -0.221422977 0.206304231
8/31/2014 HCAG Construction Materials TETR 0.0061 1.0626 0.6334 0.0292 0.5066 0.3623 0.0219 0.5091 0.3078 -1.059250041 1.092153743 0.880757903
我相信您正在寻找 groupby
+ transform
.
names = ['F1', 'F2', 'F3']
zscore = lambda x: (x - x.mean()) / x.std()
df[names] = df.groupby([df.Date, df.Style])[names].transform(zscore)
我将 groupby 更改为 year 和 company 并根据 zscores 进行筛选
F1=[0.032111,0.068824,0.076838,0.020903, 0.032807, -0.014995, -0.002233]
F2=[0.063330,0.158614,0.034735,0.154653,1.099790,0.814866,1.954578]
F3=[0.027733,0.032489,0.020062,0.056860,0.233216,0.498432,0.727823]
Date=['8/31/2014','9/30/2014','10/31/2014','11/30/2014','8/31/2014','9/30/2014','10/31/2014']
Name=['XYZ','XYZ','XYZ','XYZ','Acme','Acme','Acme']
df=pd.DataFrame({'f1':F1,'f2':F2,'f3':F3,'date':Date,'name':Name})
df['date']=pd.to_datetime(df['date'],errors='coerce')
df['year']=df['date'].dt.strftime('%Y')
df['f1']=df['f1'].astype(np.float)
df['f2']=df['f2'].astype(np.float)
df['f3']=df['f3'].astype(np.float)
print(df)
splitting=df.groupby(['year','name'])
standardized=splitting['f1','f2','f3'].transform(zscore)
print("\n zscores for f1,f2,f3", standardized)
outliers=(standardized['f1']>1)
print(df.loc[outliers])
f1 f2 f3 date name year
0 0.032111 0.063330 0.027733 2014-08-31 XYZ 2014
1 0.068824 0.158614 0.032489 2014-09-30 XYZ 2014
2 0.076838 0.034735 0.020062 2014-10-31 XYZ 2014
3 0.020903 0.154653 0.056860 2014-11-30 XYZ 2014
4 0.032807 1.099790 0.233216 2014-08-31 Acme 2014
5 -0.014995 0.814866 0.498432 2014-09-30 Acme 2014
6 -0.002233 1.954578 0.727823 2014-10-31 Acme 2014
zscores for f1,f2,f3
f1 f2 f3
0 -0.741823 -0.721383 -0.476007
1 0.809296 1.018644 -0.130533
2 1.147886 -1.243571 -1.033224
3 -1.215359 0.946310 1.639764
4 1.366408 -0.392237 -1.253219
5 -0.998952 -0.980577 0.059088
6 -0.367457 1.372814 1.194131
outliers (zscore for f1 >1)
f1 f2 f3 date name year
2 0.076838 0.034735 0.020062 2014-10-31 XYZ 2014
4 0.032807 1.099790 0.233216 2014-08-31 Acme 2014
我有一个具有以下数据样式的数据框
我正在尝试为 3 个因素(F1、F2、F3)的风格列中的每个公司每月计算 z 分数(标准化) 比如说 2014 年 8 月 31 日,我想计算当月该风格同行中每个公司在风格(比如建筑材料)中的 z 分数(分别为 F1、F2、F3)。再次针对 2014 年 8 月 31 日,我想计算该月每个 "Electronic Equipment, Instruments & Components" 的公司在样式(例如电子设备、仪器和组件)中的 z 分数。并且每个月重复这个过程。 回顾一下,首先从日期开始,然后计算每种风格的 z 分数,然后每个月重复一次。
我首先尝试定义 z-score zscr=lambda x: (x-x.mean())/x.std() 然后groupby by date, style却没有得到想要的结果。
提前致谢
Date Name Style ID \
0 8/31/2014 XYZ Construction Materials ABC
1 9/30/2014 XYZ Construction Materials ABC
2 10/31/2014 XYZ Construction Materials ABC
3 11/30/2014 XYZ Construction Materials ABC
4 8/31/2014 Acme Electronic Equipment, Instruments & Components KYZ
5 9/30/2014 Acme Electronic Equipment, Instruments & Components KYZ
6 10/31/2014 Acme Electronic Equipment, Instruments & Components KYZ
F1 F2 F3
0 0.032111 0.063330 0.027733
1 0.068824 0.158614 0.032489
2 0.076838 0.034735 0.020062
3 0.020903 0.154653 0.056860
4 0.032807 1.099790 0.233216
5 -0.014995 0.814866 0.498432
6 -0.002233 1.954578 0.727823
2014 年 8 月 31 日具有 3 个名称的样式建筑材料的详细示例
Date Name Style F1 F2 F3 Avg F1 Avg F2 Avg F3 Std F1 Std F2 Std F3 Zscore F1 Zscore F2 Zscore F3
8/31/2014 XYZ Construction Materials ABC 0.0321 0.0633 0.0277 0.0292 0.5066 0.3623 0.0219 0.5091 0.3078 0.131514468 -0.870730766 -1.087062133
8/31/2014 ABC Construction Materials XKSD 0.0495 0.3939 0.4258 0.0292 0.5066 0.3623 0.0219 0.5091 0.3078 0.927735574 -0.221422977 0.206304231
8/31/2014 HCAG Construction Materials TETR 0.0061 1.0626 0.6334 0.0292 0.5066 0.3623 0.0219 0.5091 0.3078 -1.059250041 1.092153743 0.880757903
我相信您正在寻找 groupby
+ transform
.
names = ['F1', 'F2', 'F3']
zscore = lambda x: (x - x.mean()) / x.std()
df[names] = df.groupby([df.Date, df.Style])[names].transform(zscore)
我将 groupby 更改为 year 和 company 并根据 zscores 进行筛选
F1=[0.032111,0.068824,0.076838,0.020903, 0.032807, -0.014995, -0.002233]
F2=[0.063330,0.158614,0.034735,0.154653,1.099790,0.814866,1.954578]
F3=[0.027733,0.032489,0.020062,0.056860,0.233216,0.498432,0.727823]
Date=['8/31/2014','9/30/2014','10/31/2014','11/30/2014','8/31/2014','9/30/2014','10/31/2014']
Name=['XYZ','XYZ','XYZ','XYZ','Acme','Acme','Acme']
df=pd.DataFrame({'f1':F1,'f2':F2,'f3':F3,'date':Date,'name':Name})
df['date']=pd.to_datetime(df['date'],errors='coerce')
df['year']=df['date'].dt.strftime('%Y')
df['f1']=df['f1'].astype(np.float)
df['f2']=df['f2'].astype(np.float)
df['f3']=df['f3'].astype(np.float)
print(df)
splitting=df.groupby(['year','name'])
standardized=splitting['f1','f2','f3'].transform(zscore)
print("\n zscores for f1,f2,f3", standardized)
outliers=(standardized['f1']>1)
print(df.loc[outliers])
f1 f2 f3 date name year
0 0.032111 0.063330 0.027733 2014-08-31 XYZ 2014
1 0.068824 0.158614 0.032489 2014-09-30 XYZ 2014
2 0.076838 0.034735 0.020062 2014-10-31 XYZ 2014
3 0.020903 0.154653 0.056860 2014-11-30 XYZ 2014
4 0.032807 1.099790 0.233216 2014-08-31 Acme 2014
5 -0.014995 0.814866 0.498432 2014-09-30 Acme 2014
6 -0.002233 1.954578 0.727823 2014-10-31 Acme 2014
zscores for f1,f2,f3
f1 f2 f3
0 -0.741823 -0.721383 -0.476007
1 0.809296 1.018644 -0.130533
2 1.147886 -1.243571 -1.033224
3 -1.215359 0.946310 1.639764
4 1.366408 -0.392237 -1.253219
5 -0.998952 -0.980577 0.059088
6 -0.367457 1.372814 1.194131
outliers (zscore for f1 >1)
f1 f2 f3 date name year
2 0.076838 0.034735 0.020062 2014-10-31 XYZ 2014
4 0.032807 1.099790 0.233216 2014-08-31 Acme 2014