将一列与多列进行比较
Comparing One Column against Multiple
解释这个有点复杂(请参阅下面的示例 table 以供参考)。
我有一个包含 'Date Received' 列(日期时间)的数据框
我想将 'Date Received' 与“阶段”列中的日期进行比较,以查看它是准时还是迟到。
我遇到的问题是每个文档对应一个不同的阶段,例如,文件 26 可能有一个阶段 4 日期,而文件 28 可能是阶段 1。
如何让 Python 搜索正确的阶段列,然后与收到日期进行比较?
Filename Date Received Stage 1 Expected Stage 2 Expected Stage 3 Expected Stage 4 Expected
File 1 01/01/2021 15/12/2020 NaN NaN NaN
File 2 01/01/2021 NaN 05/01/2021 NaN NaN
如果你融化你的数据框来比较列会更好。
df1 = pd.melt(df,id_vars=['Filename','Date_Received'],var_name='Expected',value_name='Date')
#df1[['Date_Received','Date']] = df1[['Date_Received','Date']].apply(pd.to_datetime)
print(df1)
Filename Date_Received Expected Date
0 File_1 2021-01-01 Stage_1_Expected 2020-12-15
1 File_2 2021-01-01 Stage_1_Expected NaT
2 File_1 2021-01-01 Stage_2_Expected NaT
3 File_2 2021-01-01 Stage_2_Expected 2021-05-01
4 File_1 2021-01-01 Stage_3_Expected NaT
5 File_2 2021-01-01 Stage_3_Expected NaT
6 File_1 2021-01-01 Stage_4_Expected NaT
7 File_2 2021-01-01 Stage_4_Expected NaT
df1.loc[df1['Date'].isna(),'Status'] = 'Not Received'
df1.loc[df1['Date'] >= df1['Date_Received'], 'Status'] = 'On Time'
df1['Status'] = df1['Status'].fillna('Late')
print(df1)
Filename Date_Received Expected Date Status
0 File_1 2021-01-01 Stage_1_Expected 2020-12-15 Late
1 File_2 2021-01-01 Stage_1_Expected NaT Not Received
2 File_1 2021-01-01 Stage_2_Expected NaT Not Received
3 File_2 2021-01-01 Stage_2_Expected 2021-05-01 On Time
4 File_1 2021-01-01 Stage_3_Expected NaT Not Received
5 File_2 2021-01-01 Stage_3_Expected NaT Not Received
6 File_1 2021-01-01 Stage_4_Expected NaT Not Received
7 File_2 2021-01-01 Stage_4_Expected NaT Not Received
您可以将 melt
与 dropna()
一起使用:
df2 = df.melt(['Filename','Date Received']).dropna()
df2 = df2.reset_index(drop=True).rename({'variable':'Stage','value':'Date'},axis='columns')
输出:
>>> df2
Filename Date Received Stage Date
0 File 1 01/01/2021 Stage 1 Expected 15/12/2020
1 File 2 01/01/2021 Stage 2 Expected 05/01/2021
而原始数据仍保留在df
现在比较:
df2['Date']=pd.to_datetime(df2['Date'], format='%d/%m/%Y')
df2['Date Received']=pd.to_datetime(df2['Date Received'], format='%d/%m/%Y')
df2['Status']=(df2['Date Received']>df2['Date']).map({False:'On-Time',True:'Late'})
比较输出:
>>> df2
Filename Date Received Stage Date Status
0 File 1 2021-01-01 Stage 1 Expected 2020-12-15 Late
1 File 2 2021-01-01 Stage 2 Expected 2021-01-05 On-Time
解释这个有点复杂(请参阅下面的示例 table 以供参考)。
我有一个包含 'Date Received' 列(日期时间)的数据框
我想将 'Date Received' 与“阶段”列中的日期进行比较,以查看它是准时还是迟到。 我遇到的问题是每个文档对应一个不同的阶段,例如,文件 26 可能有一个阶段 4 日期,而文件 28 可能是阶段 1。
如何让 Python 搜索正确的阶段列,然后与收到日期进行比较?
Filename Date Received Stage 1 Expected Stage 2 Expected Stage 3 Expected Stage 4 Expected
File 1 01/01/2021 15/12/2020 NaN NaN NaN
File 2 01/01/2021 NaN 05/01/2021 NaN NaN
如果你融化你的数据框来比较列会更好。
df1 = pd.melt(df,id_vars=['Filename','Date_Received'],var_name='Expected',value_name='Date')
#df1[['Date_Received','Date']] = df1[['Date_Received','Date']].apply(pd.to_datetime)
print(df1)
Filename Date_Received Expected Date
0 File_1 2021-01-01 Stage_1_Expected 2020-12-15
1 File_2 2021-01-01 Stage_1_Expected NaT
2 File_1 2021-01-01 Stage_2_Expected NaT
3 File_2 2021-01-01 Stage_2_Expected 2021-05-01
4 File_1 2021-01-01 Stage_3_Expected NaT
5 File_2 2021-01-01 Stage_3_Expected NaT
6 File_1 2021-01-01 Stage_4_Expected NaT
7 File_2 2021-01-01 Stage_4_Expected NaT
df1.loc[df1['Date'].isna(),'Status'] = 'Not Received'
df1.loc[df1['Date'] >= df1['Date_Received'], 'Status'] = 'On Time'
df1['Status'] = df1['Status'].fillna('Late')
print(df1)
Filename Date_Received Expected Date Status
0 File_1 2021-01-01 Stage_1_Expected 2020-12-15 Late
1 File_2 2021-01-01 Stage_1_Expected NaT Not Received
2 File_1 2021-01-01 Stage_2_Expected NaT Not Received
3 File_2 2021-01-01 Stage_2_Expected 2021-05-01 On Time
4 File_1 2021-01-01 Stage_3_Expected NaT Not Received
5 File_2 2021-01-01 Stage_3_Expected NaT Not Received
6 File_1 2021-01-01 Stage_4_Expected NaT Not Received
7 File_2 2021-01-01 Stage_4_Expected NaT Not Received
您可以将 melt
与 dropna()
一起使用:
df2 = df.melt(['Filename','Date Received']).dropna()
df2 = df2.reset_index(drop=True).rename({'variable':'Stage','value':'Date'},axis='columns')
输出:
>>> df2
Filename Date Received Stage Date
0 File 1 01/01/2021 Stage 1 Expected 15/12/2020
1 File 2 01/01/2021 Stage 2 Expected 05/01/2021
而原始数据仍保留在df
现在比较:
df2['Date']=pd.to_datetime(df2['Date'], format='%d/%m/%Y')
df2['Date Received']=pd.to_datetime(df2['Date Received'], format='%d/%m/%Y')
df2['Status']=(df2['Date Received']>df2['Date']).map({False:'On-Time',True:'Late'})
比较输出:
>>> df2
Filename Date Received Stage Date Status
0 File 1 2021-01-01 Stage 1 Expected 2020-12-15 Late
1 File 2 2021-01-01 Stage 2 Expected 2021-01-05 On-Time