根据列中唯一值的数量计算的新变量
New variable calculated on number of unique values in a column
我需要在我的数据框中计算 customer_unique_id 中的唯一值并创建一个新的 column/variable 来计算 customer_unique_id 出现的次数,删除行以仅保留一个 customer_unique_id 并最终创建一个新的类别变量。
数据框:
customer_unique_id order_id category
ef54 '0024' gift
ab58 '0587' school
yg41 '0678' gift
af48 '0469' gift
ef54 '8514' school
af48 '2771' gift
预期的 Dataframe 输出:
customer_unique_id order_id category number_of_orders category_2
ef54 '0024' gift 2 school
ab58 '0587' school 1 Nan
yg41 '0678' gift 1 Nan
af48 '0469' gift 2 gift
对我来说最重要的是了解如何创建变量 numer_of_orders,但 category_2 将是一个额外的好处。
每个 customer_unique_id 我的订单不超过两个。
可以groupby和count,除了先groupby,合并...
csv = io.StringIO('''customer_unique_id order_id category
ef54 '0024' gift
ab58 '0587' school
yg41 '0678' gift
af48 '0469' gift
ef54 '8514' school
af48 '2771' gift''')
df = pd.read_csv(csv,sep=r'\s+')
agg_df = df.groupby(['customer_unique_id'],as_index=False).first()
seconds = df.groupby(['customer_unique_id'],as_index=False).nth(1)[['customer_unique_id','category']]
agg_df = agg_df.merge(seconds,on=['customer_unique_id'],how='left')
agg_df['number_of_orders'] = df.groupby(['customer_unique_id'])['category'].count().values
>>>agg_df
customer_unique_id order_id category_x category_y number_of_orders
0 ab58 '0587' school NaN 1
1 af48 '0469' gift gift 2
2 ef54 '0024' gift school 2
3 yg41 '0678' gift NaN 1
如果要重命名索引作为示例,请执行:
agg_df.columns =['customer_unique_id order_id category category_2 number_of_orders'.split()]
>>> agg_df
customer_unique_id order_id category category_2 number_of_orders
0 ab58 '0587' school NaN 1
1 af48 '0469' gift gift 2
2 ef54 '0024' gift school 2
3 yg41 '0678' gift NaN 1
注意:预期的输出没有意义,因为有 3 次上学,或者我错过了什么?
我需要在我的数据框中计算 customer_unique_id 中的唯一值并创建一个新的 column/variable 来计算 customer_unique_id 出现的次数,删除行以仅保留一个 customer_unique_id 并最终创建一个新的类别变量。
数据框:
customer_unique_id order_id category
ef54 '0024' gift
ab58 '0587' school
yg41 '0678' gift
af48 '0469' gift
ef54 '8514' school
af48 '2771' gift
预期的 Dataframe 输出:
customer_unique_id order_id category number_of_orders category_2
ef54 '0024' gift 2 school
ab58 '0587' school 1 Nan
yg41 '0678' gift 1 Nan
af48 '0469' gift 2 gift
对我来说最重要的是了解如何创建变量 numer_of_orders,但 category_2 将是一个额外的好处。
每个 customer_unique_id 我的订单不超过两个。
可以groupby和count,除了先groupby,合并...
csv = io.StringIO('''customer_unique_id order_id category
ef54 '0024' gift
ab58 '0587' school
yg41 '0678' gift
af48 '0469' gift
ef54 '8514' school
af48 '2771' gift''')
df = pd.read_csv(csv,sep=r'\s+')
agg_df = df.groupby(['customer_unique_id'],as_index=False).first()
seconds = df.groupby(['customer_unique_id'],as_index=False).nth(1)[['customer_unique_id','category']]
agg_df = agg_df.merge(seconds,on=['customer_unique_id'],how='left')
agg_df['number_of_orders'] = df.groupby(['customer_unique_id'])['category'].count().values
>>>agg_df
customer_unique_id order_id category_x category_y number_of_orders
0 ab58 '0587' school NaN 1
1 af48 '0469' gift gift 2
2 ef54 '0024' gift school 2
3 yg41 '0678' gift NaN 1
如果要重命名索引作为示例,请执行:
agg_df.columns =['customer_unique_id order_id category category_2 number_of_orders'.split()]
>>> agg_df
customer_unique_id order_id category category_2 number_of_orders
0 ab58 '0587' school NaN 1
1 af48 '0469' gift gift 2
2 ef54 '0024' gift school 2
3 yg41 '0678' gift NaN 1
注意:预期的输出没有意义,因为有 3 次上学,或者我错过了什么?