如何根据条件从特定点反转累积计数,然后在 pandas 数据框中恢复计数?
How do I reverse a cumulative count from a specific point based on a condition and then resume the count in a pandas data frame?
我正在尝试计算日期之间的天数(累计),(按表示为 id 的列分组),但是,我想在满足条件时重置计数器。
我想同时创建一个新列并将这些特定行的值添加到该列。此外,我还想倒数重置点,表示负天数。
目前我试过这个:
import pandas as pd
import numpy as np
df = pd.DataFrame({'reset':['N','N','Y','N','N','Y','Y','Y','Y','Y', 'Y'],
'category':['low','low','low','low','low','medium','high','high','medium','medium', 'medium'],
'date':['2019-09-04','2019-09-05','2019-09-06','2019-09-07','2019-09-08','2021-05-23','2021-05-23','2021-05-23','2021-05-23','2021-05-23', '2021-05-22'],
'id':[16860,16860,16860,16860,16860,17611,23409,21765,19480,9166, 9166]
})
df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d')
df = df.sort_values(['id','date'])
#create extra grouping column based on reset day
df['group'] = df['reset'].replace({'N':False,'Y':True})
df['group'] = df.groupby('id')['group'].cumsum()
df['tdelta'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D')
df['tdelta'] = df.groupby(['id','group'])['tdelta'].cumsum().fillna(0)
df = df.sort_values(by='date', ascending=False)
df['tdelta reverse'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D')
df['tdelta reverse'] = df.groupby(['id','group'])['tdelta reverse'].cumsum().fillna(0)
df = df.sort_values(['id','date'])
print(df)
产生这个:
reset category date id group tdelta tdelta reverse
10 Y medium 2021-05-22 9166 1.0 0.0 0.0
9 Y medium 2021-05-23 9166 2.0 0.0 0.0
0 N low 2019-09-04 16860 0.0 0.0 -1.0
1 N low 2019-09-05 16860 0.0 1.0 0.0
2 Y low 2019-09-06 16860 1.0 0.0 -2.0
3 N low 2019-09-07 16860 1.0 1.0 -1.0
4 N low 2019-09-08 16860 1.0 2.0 0.0
5 Y medium 2021-05-23 17611 1.0 0.0 0.0
8 Y medium 2021-05-23 19480 1.0 0.0 0.0
7 Y high 2021-05-23 21765 1.0 0.0 0.0
6 Y high 2021-05-23 23409 1.0 0.0 0.0
现在,我添加了“tdelta reverse”,这是一个更清晰的示例(使用不同的数据),说明我希望数据框在最终结果中看起来像什么:
reset category date id tdelta1 tdelta2 tdelta3 tdelta# ...
N medium 22/05/2021 16860 -4
N medium 23/05/2021 16860 -3
N medium 24/05/2021 16860 -2
N medium 25/05/2021 16860 -1
Y medium 26/05/2021 16860 0
N medium 27/05/2021 16860 1 -4
N medium 28/05/2021 16860 2 -3
N medium 29/05/2021 16860 3 -2
N medium 30/05/2021 16860 4 -1
Y medium 31/05/2021 16860 0
N medium 01/06/2021 16860 1 -3
N medium 02/06/2021 16860 2 -2
N medium 03/06/2021 16860 3 -1
Y medium 04/06/2021 16860 0
N medium 05/06/2021 16860 1
N medium 06/06/2021 16860 2
N medium 07/06/2021 16860 3
N medium 08/06/2021 16860 4
本质上,应该为每个组创建一个新的 'tdelta#' 列,我们在重置点之前获得 'tdelta reverse' 值,之后获得 'tdelta' 值(对于每个组) .
附带说明,如果一个id没有多个组(重置点),可以不填写这些额外的'tdelta#'列。
目前,我正在创建新列并用 'tdelta' 值填充它们:
for group in df['group'].unique():
df[f'tdelta{int(group)}'] = df[(df.group == group)]['tdelta']
但是,我还需要添加 'tdelta reverse' 值,这样看起来就像我的最终示例。
我在想我或许应该将 iloc 与 groupby 一起使用 and/or 做一些拼接?
关于如何解决这个问题有什么建议吗?
所以我通过添加 pandas combine_first
函数解决了这个问题(尽管我认为这是一种临时方法),该函数结合了来自 [=13] 中两列的非 nan 值=] 和 except
语句在下面的代码下方:
# defined a new df for clearer output
df = pd.DataFrame({'reset':['N','Y','N','N','N','Y','N','N','Y','N','N'],
'category':['low','low','low','low','low','low','low','low','low','low', 'low'],
'date':['2019-09-04','2020-11-06','2020-11-06','2019-09-07','2019-11-08','2021-05-21','2021-06-23','2021-07-24','2021-08-25','2021-09-23', '2021-10-21'],
'id':[16860,16860,16860,16860,16860,16860,16860,16860,16860,16860, 16860]
})
df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d')
df = df.sort_values(['id','date'])
#create extra grouping column based on reset day
df['group'] = df['reset'].replace({'N':False,'Y':True})
df['group'] = df.groupby('id')['group'].cumsum()
df['tdelta'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D')
df['tdelta'] = df.groupby(['id','group'])['tdelta'].cumsum().fillna(0)
df = df.sort_values(by='date', ascending=False)
df['tdelta reverse'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D')
df['tdelta reverse'] = df.groupby(['id','group'])['tdelta reverse'].cumsum().fillna(0)
# the problem solved via combine_first which combines the non nan values from both columns
df = df.sort_values(['id','date'])
for group in df['group'].unique():
group_minus_1 = group - 1.0
try:
df[f'tdelta{int(group)}'] = df[(df['group'] == group)]['tdelta']
df[f'tdelta{int(group)}'] = df[f'tdelta{int(group)}'].combine_first(df[(df['group'] == group_minus_1)]['tdelta reverse'])
except:
continue
#print(df)
这是输出:
reset category date id group tdelta tdelta reverse tdelta0 tdelta1 tdelta2 tdelta3
0 N low 2019-09-04 16860 0.0 NaN -65.0 0.0 -65.0 NaN NaN
3 N low 2019-09-07 16860 0.0 NaN -62.0 3.0 -62.0 NaN NaN
4 N low 2019-11-08 16860 0.0 NaN 0.0 65.0 0.0 NaN NaN
1 Y low 2020-11-06 16860 1.0 250.0 0.0 NaN 0.0 0.0 NaN
2 N low 2020-11-06 16860 1.0 250.0 0.0 NaN 0.0 0.0 NaN
5 Y low 2021-05-21 16860 2.0 250.0 -64.0 NaN NaN 0.0 -64.0
6 N low 2021-06-23 16860 2.0 NaN -31.0 NaN NaN 33.0 -31.0
7 N low 2021-07-24 16860 2.0 NaN 0.0 NaN NaN 64.0 0.0
8 Y low 2021-08-25 16860 3.0 250.0 -57.0 NaN NaN NaN 0.0
9 N low 2021-09-23 16860 3.0 NaN -28.0 NaN NaN NaN 29.0
10 N low 2021-10-21 16860 3.0 NaN 0.0 NaN NaN NaN 57.0
我整个上午都在玩它,除了解决一个非常简化的 df 版本并在其上使用残酷的循环之外没有进一步的进展:
df = pd.DataFrame({'reset':['N','N','N','Y','N','N','N','N','Y','N', 'N','N','Y','N','Y', 'N'],
'date':[3, 7, 14, 15, 17, 26, 32, 38, 53, 63, 67, 70, 72, 85, 87, 92]})
cols_b = df.columns
# Y or N index list
reset = df['reset'].tolist()
range(len(reset))
res_list = []
for i in range(0, len(reset)) :
if reset[i] == 'Y' :
res_list.append(i)
#lets create a column for each reset 'Y' value:
for i in range(len(df)):
if df['reset'].iloc[i] == 'N':
None
else:
df['tdelta{}'.format(i)] = None
cols = df.columns
#check how many new cols we have:
new_cols = len(cols_b) - len(cols)
new_cols_index = list(range(new_cols,0))
# so... we have a list of row indexes with Y:
res_list
# we have a list of new column indexes:
new_cols_index
# and we have a list of indexes of the above lists:
list(range(len(res_list)))
for el in list(range(len(res_list))):
#first column, lets fill it with number 2:
if el == 0:
df.iloc[:res_list[el],new_cols_index[el]] = 2
df.iloc[res_list[el]+1:res_list[el+1],new_cols_index[el]] = 2
#lets change all the cells with number 2 in this column (if it's a different value fill it with None):
df.iloc[:,new_cols_index[el]] = np.where(df.iloc[:,new_cols_index[el]]==2, df['date'] - df['date'].iloc[res_list[el]],None)
#all the middle columns lets fill them with number 4:
if (el > 0) & (el < max(list(range(len(res_list))))):
df.iloc[res_list[el-1]+1:res_list[el],new_cols_index[el]] = 4
df.iloc[res_list[el]+1:res_list[el+1],new_cols_index[el]] = 4
#lets change all the cells with number 4 in this column (if it's a different value fill it with None):
df.iloc[:,new_cols_index[el]] = np.where(df.iloc[:,new_cols_index[el]]==4, df['date'] - df['date'].iloc[res_list[el]],None)
#last column, lets fill it with number 6:
if el == max(list(range(len(res_list)))):
df.iloc[res_list[el-1]+1:res_list[el],new_cols_index[el]] = 6
df.iloc[res_list[el]+1:,new_cols_index[el]] = 6
#lets change all the cells with number 6 in this column (if it's a different value fill it with None):
df.iloc[:,new_cols_index[el]] = np.where(df.iloc[:,new_cols_index[el]]==6, df['date'] - df['date'].iloc[res_list[el]],None)
# assign 0 value to 'Y' row:
for el in list(range(len(res_list))):
# create a 0 value in each column for first 'Y'
if df['reset'].iloc[res_list[el]] == 'Y':
df.iloc[res_list[el],new_cols_index[el]] = 0
我正在尝试计算日期之间的天数(累计),(按表示为 id 的列分组),但是,我想在满足条件时重置计数器。
我想同时创建一个新列并将这些特定行的值添加到该列。此外,我还想倒数重置点,表示负天数。
目前我试过这个:
import pandas as pd
import numpy as np
df = pd.DataFrame({'reset':['N','N','Y','N','N','Y','Y','Y','Y','Y', 'Y'],
'category':['low','low','low','low','low','medium','high','high','medium','medium', 'medium'],
'date':['2019-09-04','2019-09-05','2019-09-06','2019-09-07','2019-09-08','2021-05-23','2021-05-23','2021-05-23','2021-05-23','2021-05-23', '2021-05-22'],
'id':[16860,16860,16860,16860,16860,17611,23409,21765,19480,9166, 9166]
})
df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d')
df = df.sort_values(['id','date'])
#create extra grouping column based on reset day
df['group'] = df['reset'].replace({'N':False,'Y':True})
df['group'] = df.groupby('id')['group'].cumsum()
df['tdelta'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D')
df['tdelta'] = df.groupby(['id','group'])['tdelta'].cumsum().fillna(0)
df = df.sort_values(by='date', ascending=False)
df['tdelta reverse'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D')
df['tdelta reverse'] = df.groupby(['id','group'])['tdelta reverse'].cumsum().fillna(0)
df = df.sort_values(['id','date'])
print(df)
产生这个:
reset category date id group tdelta tdelta reverse
10 Y medium 2021-05-22 9166 1.0 0.0 0.0
9 Y medium 2021-05-23 9166 2.0 0.0 0.0
0 N low 2019-09-04 16860 0.0 0.0 -1.0
1 N low 2019-09-05 16860 0.0 1.0 0.0
2 Y low 2019-09-06 16860 1.0 0.0 -2.0
3 N low 2019-09-07 16860 1.0 1.0 -1.0
4 N low 2019-09-08 16860 1.0 2.0 0.0
5 Y medium 2021-05-23 17611 1.0 0.0 0.0
8 Y medium 2021-05-23 19480 1.0 0.0 0.0
7 Y high 2021-05-23 21765 1.0 0.0 0.0
6 Y high 2021-05-23 23409 1.0 0.0 0.0
现在,我添加了“tdelta reverse”,这是一个更清晰的示例(使用不同的数据),说明我希望数据框在最终结果中看起来像什么:
reset category date id tdelta1 tdelta2 tdelta3 tdelta# ...
N medium 22/05/2021 16860 -4
N medium 23/05/2021 16860 -3
N medium 24/05/2021 16860 -2
N medium 25/05/2021 16860 -1
Y medium 26/05/2021 16860 0
N medium 27/05/2021 16860 1 -4
N medium 28/05/2021 16860 2 -3
N medium 29/05/2021 16860 3 -2
N medium 30/05/2021 16860 4 -1
Y medium 31/05/2021 16860 0
N medium 01/06/2021 16860 1 -3
N medium 02/06/2021 16860 2 -2
N medium 03/06/2021 16860 3 -1
Y medium 04/06/2021 16860 0
N medium 05/06/2021 16860 1
N medium 06/06/2021 16860 2
N medium 07/06/2021 16860 3
N medium 08/06/2021 16860 4
本质上,应该为每个组创建一个新的 'tdelta#' 列,我们在重置点之前获得 'tdelta reverse' 值,之后获得 'tdelta' 值(对于每个组) .
附带说明,如果一个id没有多个组(重置点),可以不填写这些额外的'tdelta#'列。
目前,我正在创建新列并用 'tdelta' 值填充它们:
for group in df['group'].unique():
df[f'tdelta{int(group)}'] = df[(df.group == group)]['tdelta']
但是,我还需要添加 'tdelta reverse' 值,这样看起来就像我的最终示例。
我在想我或许应该将 iloc 与 groupby 一起使用 and/or 做一些拼接?
关于如何解决这个问题有什么建议吗?
所以我通过添加 pandas combine_first
函数解决了这个问题(尽管我认为这是一种临时方法),该函数结合了来自 [=13] 中两列的非 nan 值=] 和 except
语句在下面的代码下方:
# defined a new df for clearer output
df = pd.DataFrame({'reset':['N','Y','N','N','N','Y','N','N','Y','N','N'],
'category':['low','low','low','low','low','low','low','low','low','low', 'low'],
'date':['2019-09-04','2020-11-06','2020-11-06','2019-09-07','2019-11-08','2021-05-21','2021-06-23','2021-07-24','2021-08-25','2021-09-23', '2021-10-21'],
'id':[16860,16860,16860,16860,16860,16860,16860,16860,16860,16860, 16860]
})
df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d')
df = df.sort_values(['id','date'])
#create extra grouping column based on reset day
df['group'] = df['reset'].replace({'N':False,'Y':True})
df['group'] = df.groupby('id')['group'].cumsum()
df['tdelta'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D')
df['tdelta'] = df.groupby(['id','group'])['tdelta'].cumsum().fillna(0)
df = df.sort_values(by='date', ascending=False)
df['tdelta reverse'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D')
df['tdelta reverse'] = df.groupby(['id','group'])['tdelta reverse'].cumsum().fillna(0)
# the problem solved via combine_first which combines the non nan values from both columns
df = df.sort_values(['id','date'])
for group in df['group'].unique():
group_minus_1 = group - 1.0
try:
df[f'tdelta{int(group)}'] = df[(df['group'] == group)]['tdelta']
df[f'tdelta{int(group)}'] = df[f'tdelta{int(group)}'].combine_first(df[(df['group'] == group_minus_1)]['tdelta reverse'])
except:
continue
#print(df)
这是输出:
reset category date id group tdelta tdelta reverse tdelta0 tdelta1 tdelta2 tdelta3
0 N low 2019-09-04 16860 0.0 NaN -65.0 0.0 -65.0 NaN NaN
3 N low 2019-09-07 16860 0.0 NaN -62.0 3.0 -62.0 NaN NaN
4 N low 2019-11-08 16860 0.0 NaN 0.0 65.0 0.0 NaN NaN
1 Y low 2020-11-06 16860 1.0 250.0 0.0 NaN 0.0 0.0 NaN
2 N low 2020-11-06 16860 1.0 250.0 0.0 NaN 0.0 0.0 NaN
5 Y low 2021-05-21 16860 2.0 250.0 -64.0 NaN NaN 0.0 -64.0
6 N low 2021-06-23 16860 2.0 NaN -31.0 NaN NaN 33.0 -31.0
7 N low 2021-07-24 16860 2.0 NaN 0.0 NaN NaN 64.0 0.0
8 Y low 2021-08-25 16860 3.0 250.0 -57.0 NaN NaN NaN 0.0
9 N low 2021-09-23 16860 3.0 NaN -28.0 NaN NaN NaN 29.0
10 N low 2021-10-21 16860 3.0 NaN 0.0 NaN NaN NaN 57.0
我整个上午都在玩它,除了解决一个非常简化的 df 版本并在其上使用残酷的循环之外没有进一步的进展:
df = pd.DataFrame({'reset':['N','N','N','Y','N','N','N','N','Y','N', 'N','N','Y','N','Y', 'N'],
'date':[3, 7, 14, 15, 17, 26, 32, 38, 53, 63, 67, 70, 72, 85, 87, 92]})
cols_b = df.columns
# Y or N index list
reset = df['reset'].tolist()
range(len(reset))
res_list = []
for i in range(0, len(reset)) :
if reset[i] == 'Y' :
res_list.append(i)
#lets create a column for each reset 'Y' value:
for i in range(len(df)):
if df['reset'].iloc[i] == 'N':
None
else:
df['tdelta{}'.format(i)] = None
cols = df.columns
#check how many new cols we have:
new_cols = len(cols_b) - len(cols)
new_cols_index = list(range(new_cols,0))
# so... we have a list of row indexes with Y:
res_list
# we have a list of new column indexes:
new_cols_index
# and we have a list of indexes of the above lists:
list(range(len(res_list)))
for el in list(range(len(res_list))):
#first column, lets fill it with number 2:
if el == 0:
df.iloc[:res_list[el],new_cols_index[el]] = 2
df.iloc[res_list[el]+1:res_list[el+1],new_cols_index[el]] = 2
#lets change all the cells with number 2 in this column (if it's a different value fill it with None):
df.iloc[:,new_cols_index[el]] = np.where(df.iloc[:,new_cols_index[el]]==2, df['date'] - df['date'].iloc[res_list[el]],None)
#all the middle columns lets fill them with number 4:
if (el > 0) & (el < max(list(range(len(res_list))))):
df.iloc[res_list[el-1]+1:res_list[el],new_cols_index[el]] = 4
df.iloc[res_list[el]+1:res_list[el+1],new_cols_index[el]] = 4
#lets change all the cells with number 4 in this column (if it's a different value fill it with None):
df.iloc[:,new_cols_index[el]] = np.where(df.iloc[:,new_cols_index[el]]==4, df['date'] - df['date'].iloc[res_list[el]],None)
#last column, lets fill it with number 6:
if el == max(list(range(len(res_list)))):
df.iloc[res_list[el-1]+1:res_list[el],new_cols_index[el]] = 6
df.iloc[res_list[el]+1:,new_cols_index[el]] = 6
#lets change all the cells with number 6 in this column (if it's a different value fill it with None):
df.iloc[:,new_cols_index[el]] = np.where(df.iloc[:,new_cols_index[el]]==6, df['date'] - df['date'].iloc[res_list[el]],None)
# assign 0 value to 'Y' row:
for el in list(range(len(res_list))):
# create a 0 value in each column for first 'Y'
if df['reset'].iloc[res_list[el]] == 'Y':
df.iloc[res_list[el],new_cols_index[el]] = 0