根据列中唯一值的数量计算的新变量

New variable calculated on number of unique values in a column

我需要在我的数据框中计算 customer_unique_id 中的唯一值并创建一个新的 column/variable 来计算 customer_unique_id 出现的次数,删除行以仅保留一个 customer_unique_id 并最终创建一个新的类别变量。

数据框:

customer_unique_id       order_id        category
   ef54                     '0024'           gift
   ab58                     '0587'         school
   yg41                     '0678'           gift
   af48                     '0469'           gift
   ef54                     '8514'         school
   af48                     '2771'           gift

预期的 Dataframe 输出:

     customer_unique_id       order_id        category    number_of_orders      category_2
       ef54                     '0024'           gift            2                 school
       ab58                     '0587'         school            1                 Nan
       yg41                     '0678'           gift            1                 Nan
       af48                     '0469'           gift            2                 gift

对我来说最重要的是了解如何创建变量 numer_of_orders,但 category_2 将是一个额外的好处。

每个 customer_unique_id 我的订单不超过两个。

可以groupby和count,除了先groupby,合并...

csv = io.StringIO('''customer_unique_id       order_id        category
   ef54                     '0024'           gift
   ab58                     '0587'         school
   yg41                     '0678'           gift
   af48                     '0469'           gift
   ef54                     '8514'         school
   af48                     '2771'           gift''')
df = pd.read_csv(csv,sep=r'\s+')
agg_df = df.groupby(['customer_unique_id'],as_index=False).first()
seconds = df.groupby(['customer_unique_id'],as_index=False).nth(1)[['customer_unique_id','category']]
agg_df = agg_df.merge(seconds,on=['customer_unique_id'],how='left')
agg_df['number_of_orders'] = df.groupby(['customer_unique_id'])['category'].count().values
>>>agg_df

    customer_unique_id  order_id    category_x  category_y  number_of_orders
0   ab58                '0587'      school      NaN         1
1   af48                '0469'      gift        gift        2
2   ef54                '0024'      gift        school      2
3   yg41                '0678'      gift        NaN         1

如果要重命名索引作为示例,请执行:

agg_df.columns =['customer_unique_id order_id category category_2 number_of_orders'.split()]
>>> agg_df

    customer_unique_id  order_id    category    category_2  number_of_orders
0   ab58                '0587'      school      NaN         1
1   af48                '0469'      gift        gift        2
2   ef54                '0024'      gift        school      2
3   yg41                '0678'      gift        NaN         1

注意:预期的输出没有意义,因为有 3 次上学,或者我错过了什么?