如何在 pandas 上单独计算特征重复(或 Ridit 特征工程)
How to count feature duplication (or Ridit feature engineering) individually on pandas
这对我的机器学习项目来说似乎有多种用途,它可以是重复计数,也可以用作特征提取,幸运的是可以用于数值和分类,Ridit Analysys
我的数据好像重复了很多,想查一下。这是我的数据
No feature_1 feature_2 feature_3
1. 67 45 56
2. 67 40 56
3. 67 40 51
这就是我想要的
No feature_1 feature_2 feature_3 duplication_1 duplication_2 duplication_3
1. 67 45 56 3 1 2
2. 67 40 56 3 2 2
3. 67 40 51 3 2 1
我做的是
df1 = df.groupby(['feature_1']).size().reset_index()
df1.columns = ['customer_id', 'duplication_1']
df = df.merge(df1, on='customer_id', how='left')
df2 = df.groupby(['feature_2']).size().reset_index()
df2.columns = ['customer_id', 'duplication_2']
df = df.merge(df2, on='customer_id', how='left')
df3 = df.groupby(['feature_3']).size().reset_index()
df3.columns = ['customer_id', 'duplication_3']
df = df.merge(df3, on='customer_id', how='left')
但我正在寻找更快的更好的替代方法,尤其是当我们拥有大量功能时
对每列使用 map
with value_counts
or transform
:
for i, x in enumerate(df.columns):
df['duplication_{}'.format(i + 1)] = df[x].map(df[x].value_counts())
#alternative
#df['duplication_{}'.format(i + 1)] = df.groupby(x)[x].transform('size')
print (df)
feature_1 feature_2 feature_3 duplication_1 duplication_2 \
No
1.0 67 45 56 3 1
2.0 67 40 56 3 2
3.0 67 40 51 3 2
duplication_3
No
1.0 2
2.0 2
3.0 1
这对我的机器学习项目来说似乎有多种用途,它可以是重复计数,也可以用作特征提取,幸运的是可以用于数值和分类,Ridit Analysys
我的数据好像重复了很多,想查一下。这是我的数据
No feature_1 feature_2 feature_3
1. 67 45 56
2. 67 40 56
3. 67 40 51
这就是我想要的
No feature_1 feature_2 feature_3 duplication_1 duplication_2 duplication_3
1. 67 45 56 3 1 2
2. 67 40 56 3 2 2
3. 67 40 51 3 2 1
我做的是
df1 = df.groupby(['feature_1']).size().reset_index()
df1.columns = ['customer_id', 'duplication_1']
df = df.merge(df1, on='customer_id', how='left')
df2 = df.groupby(['feature_2']).size().reset_index()
df2.columns = ['customer_id', 'duplication_2']
df = df.merge(df2, on='customer_id', how='left')
df3 = df.groupby(['feature_3']).size().reset_index()
df3.columns = ['customer_id', 'duplication_3']
df = df.merge(df3, on='customer_id', how='left')
但我正在寻找更快的更好的替代方法,尤其是当我们拥有大量功能时
对每列使用 map
with value_counts
or transform
:
for i, x in enumerate(df.columns):
df['duplication_{}'.format(i + 1)] = df[x].map(df[x].value_counts())
#alternative
#df['duplication_{}'.format(i + 1)] = df.groupby(x)[x].transform('size')
print (df)
feature_1 feature_2 feature_3 duplication_1 duplication_2 \
No
1.0 67 45 56 3 1
2.0 67 40 56 3 2
3.0 67 40 51 3 2
duplication_3
No
1.0 2
2.0 2
3.0 1