Pandas - Groupby Company 并根据基于乱序值日期的标准删除行

Pandas - Groupby Company and drop rows according to criteria based off the Dates of values being out of order

我有一个历史数据日志,想按公司计算进度之间的天数(前期的时间戳必须小于后期)。

Company   Progress      Time
AAA     3. Contract   07/10/2020
AAA     2. Discuss    03/09/2020
AAA     1. Start      02/02/2020
BBB     3. Contract   11/13/2019
BBB     3. Contract   07/01/2019
BBB     1. Start      06/22/2019
BBB     2. Discuss    04/15/2019
CCC     3. Contract   05/19/2020
CCC     2. Discuss    04/08/2020
CCC     2. Discuss    03/12/2020
CCC     1. Start      01/01/2020

预期产出:

进展(1.开始 --> 2.讨论)

Company   Progress      Time
AAA     1. Start      02/02/2020
AAA     2. Discuss    03/09/2020
CCC     1. Start      01/01/2020
CCC     2. Discuss    03/12/2020

进展(2.讨论 --> 3.合同)

Company   Progress      Time
AAA     2. Discuss    03/09/2020
AAA     3. Contract   07/10/2020
CCC     2. Discuss    03/12/2020
CCC     3. Contract   05/19/2020

我确实尝试了一些愚蠢的方法来完成这项工作,但在 excel 中仍然需要 manualyl 过滤器,下面是我的编码:

df_stage1_stage2 = df[(df['Progress']=='1. Start')|(df['Progress']=='2. Discuss ')]
pd.pivot_table(df_stage1_stage2 ,index=['Company','Progress'],aggfunc={'Time':min})

谁能帮忙解决这个问题?谢谢

创建一些掩码以过滤掉相关行。 m1m2 过滤掉 1. Start 不是“第一个”日期时间(如果以相反顺序查看)的组)因为您的日期按 Company 升序和日期 descending).如果您还需要检查 2. Discuss3. Contract 是否有序,而不是仅检查以确保 1. 有序的当前逻辑,则可以创建更多掩码。但是,根据您提供的数据 returns 正确的输出:

m1 = df.groupby('Company')['Progress'].transform('last')
m2 = np.where((m1 == '1. Start'), 'drop', 'keep')
df = df[m2=='drop']
df

中间输出:

    Company Progress    Time
0   AAA     3. Contract 07/10/2020
1   AAA     2. Discuss  03/09/2020
2   AAA     1. Start    02/02/2020
7   CCC     3. Contract 05/19/2020
8   CCC     2. Discuss  04/08/2020
9   CCC     2. Discuss  03/12/2020
10  CCC     1. Start    01/01/2020

从那里开始,按照您的指示进行过滤,方法是根据前两列的子集对重复项进行排序和删除,并保留 'first' 重复项:

最终的 df1 和 df2 输出:

df1

df1 = df[df['Progress'] != '3. Contract'] \
.sort_values(['Company', 'Time'], ascending=[True,True]) \
.drop_duplicates(subset=['Company', 'Progress'], keep='first')

df1 输出:

    Company Progress    Time
2   AAA     1. Start    02/02/2020
1   AAA     2. Discuss  03/09/2020
10  CCC     1. Start    01/01/2020
9   CCC     2. Discuss  03/12/2020

df2

df2 = df[df['Progress'] != '1. Start'] \
.sort_values(['Company', 'Time'], ascending=[True,True]) \
.drop_duplicates(subset=['Company', 'Progress'], keep='first')

df2 输出:

    Company Progress    Time
1   AAA     2. Discuss  03/09/2020
0   AAA     3. Contract 07/10/2020
9   CCC     2. Discuss  03/12/2020
7   CCC     3. Contract 05/19/2020

假设一个已经排序的 df:

(完整示例)

data = {
    'Company':['AAA', 'AAA', 'AAA', 'BBB','BBB','BBB','BBB','CCC','CCC','CCC','CCC',],
    'Progress':['3. Contract', '2. Discuss', '1. Start', '3. Contract', '3. Contract', '2. Discuss', '1. Start', '3. Contract', '2. Discuss', '2. Discuss', '1. Start', ],
    'Time':['07-10-2020','03-09-2020','02-02-2020','11-13-2019','07-01-2019','06-22-2019','04-15-2019','05-19-2020','04-08-2020','03-12-2020','01-01-2020',],
}

df = pd.DataFrame(data)

df['Time'] = pd.to_datetime(df['Time'])

# We want to measure from the first occurrence (last date) if duplicated:
df.drop_duplicates(subset=['Company', 'Progress'], keep='first', inplace=True)

# Except for the rows of 'start', calculate the difference in days 
df['days_delta'] = np.where((df['Progress'] != '1. Start'), df.Time.diff(-1), 0)

输出:

Company Progress    Time    days_delta
0   AAA 3. Contract 2020-07-10  123 days
1   AAA 2. Discuss  2020-03-09  36 days
2   AAA 1. Start    2020-02-02  0 days
3   BBB 3. Contract 2019-11-13  144 days
5   BBB 2. Discuss  2019-06-22  68 days
6   BBB 1. Start    2019-04-15  0 days
7   CCC 3. Contract 2020-05-19  41 days
8   CCC 2. Discuss  2020-04-08  98 days
10  CCC 1. Start    2020-01-01  0 days

如果您不想在输出中使用 'days' 单词:

df['days_delta'] = df['days_delta'].dt.days

第一个问题

#Coerce Time to Datetime
df['Time']=pd.to_datetime(df['Time'])

#`groupby().nth[]` `to slice the consecutive order`
df2=(df.merge(df.groupby(['Company'])['Time'].nth([-2,-1]))).sort_values(by=['Company','Time'], ascending=[True, True])

#Apply the universal rule for this problem which is, after groupby nth, drop any agroup with duplicates
   df2[~df2.Company.isin(df2[df2.groupby('Company').Progress.transform('nunique')==1].Company.values)]

#Calculate the diff() in Time in each group

df2['diff'] = df2.sort_values(by='Progress').groupby('Company')['Time'].diff().dt.days.fillna(0)#.groupby('Company')['Time'].diff() / np.timedelta64(1, 'D')
#Filter out the groups where start and Discuss Time are in conflict
df2[~df2.Company.isin(df2.loc[df2['diff']<0, 'Company'].unique())]

Company   Progress       Time  diff
1     AAA    1.Start 2020-02-02   0.0
0     AAA  2.Discuss 2020-03-09  36.0
5     CCC    1.Start 2020-01-01   0.0
4     CCC  2.Discuss 2020-03-12  71.0

第二题

#Groupbynth to slice right consecutive groups
df2=(df.merge(df.groupby(['Company'])['Time'].nth([0,1]))).sort_values(by=['Company','Time'], ascending=[True, True])

#Drop any groups after grouping that have duplicates


df2[~df2.Company.isin(df2[df2.groupby('Company').Progress.transform('nunique')==1].Company.values)]


  Company    Progress       Time
1     AAA   2.Discuss 2020-03-09
0     AAA  3.Contract 2020-07-10
5     CCC   2.Discuss 2020-04-08
4     CCC  3.Contract 2020-05-19