找出每月不等值的平均值并根据某些条件进行分配
Finding the average of the unequal values per month and distribute it based on some conditions
我目前正在努力将我的数据转换成有用的数据集。我需要从第一个月到最后一个月平均分配付款。问题是付款不一致且不平等。此外,还有一些款项已经全额支付,应根据协议数据框从第一笔付款加上适用的期限开始分配。
我的 table 如下:
第一 table:付款
cust_id
agreement_id
date
payment
1
A
12/1/20
200
1
A
2/2/21
200
1
A
2/3/21
100
1
A
5/1/21
200
1
B
1/2/21
50
1
B
1/9/21
20
1
B
3/1/21
80
1
B
4/23/21
90
2
C
1/21/21
600
3
D
3/4/21
150
3
D
5/3/21
150
这是支付数据框的代码:
payments = pd.DataFrame.from_dict({'cust_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 3, 10: 3},
'agreement_id': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'B', 5: 'B', 6: 'B', 7: 'B',
8: 'C', 9: 'D', 10: 'D'},
'date': {0: '12/1/20', 1: '2/2/21', 2: '2/3/21', 3: '5/1/21', 4: '1/2/21', 5: '1/9/21',
6: '3/1/21', 7: '4/23/21', 8: '1/21/21', 9: '3/4/21', 10: '5/3/21'},
'payment': {0: 200, 1: 200, 2: 100, 3: 200, 4: 50, 5: 20, 6: 80, 7: 90, 8: 600, 9: 150, 10: 150}})
第二个table:同意
agreement_id
activation
term_months
total_fee
A
12/1/20
24
4800
B
1/21/21
6
600
C
1/21/21
6
600
D
3/4/21
6
300
这是协议数据框的代码:
agreement = pd.DataFrame.from_dict({'agreement_id': {0: 'A', 1: 'B', 2: 'C', 3: 'D'}, 'activation': {0: '12/1/20', 1: '1/2/21', 2: '1/21/21', 3: '3/4/21'}, 'term_months': {0: 24, 1: 6, 2: 6, 3: 6}, 'total_fee': {0: 4800, 1: 300, 2: 600, 3: 300}})
我想要的结果如下:
cust_id
agreement_id
date
payment
1
A
12/1/20
116.67
1
A
1/1/21
116.67
1
A
2/1/21
116.67
1
A
3/1/21
116.67
1
A
4/1/21
116.67
1
A
5/1/21
116.67
1
B
1/1/21
60
1
B
2/1/21
60
1
B
3/1/21
60
1
B
4/1/21
60
2
C
1/1/21
100
2
C
2/1/21
100
2
C
3/1/21
100
2
C
4/1/21
100
2
C
5/1/21
100
2
C
6/1/21
100
3
D
3/1/21
50
3
D
4/1/21
50
3
D
5/1/21
50
3
D
6/1/21
50
3
D
7/1/21
50
3
D
8/1/21
50
或者,代码形式:
cust_id agreement_id date payment
0 1 A 12/1/20 116.67
1 1 A 1/1/21 116.67
2 1 A 2/1/21 116.67
3 1 A 3/1/21 116.67
4 1 A 4/1/21 116.67
5 1 A 5/1/21 116.67
6 1 B 1/1/21 60.00
7 1 B 2/1/21 60.00
8 1 B 3/1/21 60.00
9 1 B 4/1/21 60.00
10 2 C 1/1/21 100.00
11 2 C 2/1/21 100.00
12 2 C 3/1/21 100.00
13 2 C 4/1/21 100.00
14 2 C 5/1/21 100.00
15 2 C 6/1/21 100.00
16 3 D 3/1/21 50.00
17 3 D 4/1/21 50.00
18 3 D 5/1/21 50.00
19 3 D 6/1/21 50.00
20 3 D 7/1/21 50.00
21 3 D 8/1/21 50.00
激活与首次付款日期相同。
我尝试使用以下代码(由 AlexK 建议)创建另一列,但它仅适用于总付款少于总费用的情况。但是,当总付款等于总费用时,我需要从付款开始到月末相应地分配付款(开始加上以月为单位的条款)。
payments['date'] = pd.to_datetime(payments['date'])
resampled_payments = (payments
.set_index('date')
.groupby(['cust_id', 'agreement_id'])
.resample('MS')
.agg({'payment': sum})
.reset_index()
)
resampled_payments['avg_monthly_payment'] = (resampled_payments
.groupby(['cust_id', 'agreement_id'])['payment']
.transform('mean')
)
这是 R
解决方案(因为你也用 R 标记了它)
#load libraries
library(tidyverse)
library(lubridate)
pymts <- read.table(text = "cust_id agreement_id date payment
1 A 12/1/20 200
1 A 2/2/21 200
1 A 2/3/21 100
1 A 5/1/21 200
1 B 1/2/21 50
1 B 1/9/21 20
1 B 3/1/21 80
1 B 4/23/21 90
2 C 1/21/21 600
3 D 3/4/21 150
3 D 5/3/21 150", header = T)
agmt <- read.table(text = "agreement_id activation term_months total_fee
A 12/1/20 24 4800
B 1/21/21 6 600
C 1/21/21 6 600
D 3/4/21 6 300", header = T)
#final code
final<- pymts %>% mutate(date = as.Date(date, "%m/%d/%y")) %>%
left_join(agmt %>% mutate(activation = as.Date(activation, "%m/%d/%y")), by = "agreement_id") %>%
group_by(cust_id, agreement_id) %>%
mutate(d = n(),
date = floor_date(date, "month")) %>%
complete(date = seq.Date(from = min(date), by = "month", length.out = ifelse(sum(payment) == first(total_fee),
first(term_months),
(year(max(date)) -
year(min(date)))*12 +
month(max(date)) -
month(min(date)) +1))) %>%
mutate(payment = sum(payment, na.rm = T)) %>%
filter(!duplicated(date)) %>%
mutate(payment = payment/n()) %>%
select(1:4) %>% ungroup()
final
# A tibble: 22 x 4
cust_id agreement_id date payment
<int> <chr> <date> <dbl>
1 1 A 2020-12-01 117.
2 1 A 2021-01-01 117.
3 1 A 2021-02-01 117.
4 1 A 2021-03-01 117.
5 1 A 2021-04-01 117.
6 1 A 2021-05-01 117.
7 1 B 2021-01-01 60
8 1 B 2021-02-01 60
9 1 B 2021-03-01 60
10 1 B 2021-04-01 60
# ... with 12 more rows
鉴于您的数据框,这应该可行
from dateutil.relativedelta import relativedelta
# Transofrm column to date
payments['date']= pd.to_datetime(payments['date'])
agreement['activation']= pd.to_datetime(agreement['activation'])
final =pd.merge(payments,agreement,on='agreement_id',how='left')
# set date to beginning of month
final['date'] = pd.to_datetime(final.date).dt.to_period('M').dt.to_timestamp()
def set_date_range(df):
if df['payment'].sum() == df['total_fee'].iloc[0]:
return pd.date_range(min(g['date']), periods=df['term_months'].iloc[0], freq='M')
else:
return pd.date_range(min(g['date']),
max(g['date'])+relativedelta(months=+1), freq='M' )
# Create dataframe with dates
seq_df = pd.DataFrame()
for i,g in final.groupby(['cust_id', 'agreement_id']):
seq_df = pd.concat([seq_df,
pd.DataFrame({'cust_id': i[0], 'agreement_id': i[1], 'date': set_date_range(g)})])
# Set date to beginnig of month
seq_df['date'] = pd.to_datetime(seq_df.date).dt.to_period('M').dt.to_timestamp()
final = (pd.concat([final, seq_df], sort=True)
.sort_values(['cust_id', 'agreement_id', 'date'])
.reset_index(drop=True)
.reindex(columns=final.columns))
final['payment'] = final.groupby(by=['cust_id', 'agreement_id'])["payment"].transform("sum")
final = final.drop_duplicates(['cust_id', 'agreement_id', 'date'])
final['n'] = final.groupby(by=['cust_id', 'agreement_id'])["cust_id"].transform("count")
final['payment_due'] = final['payment']/final['n']
final[['cust_id','agreement_id','date', 'payment_due']]
我无法完全复制管道形式 tidyverse
但输出应该匹配。最难的部分是 seq_df
的创建,但它应该没问题(针对更一般的用例对其进行双重测试)
我目前正在努力将我的数据转换成有用的数据集。我需要从第一个月到最后一个月平均分配付款。问题是付款不一致且不平等。此外,还有一些款项已经全额支付,应根据协议数据框从第一笔付款加上适用的期限开始分配。
我的 table 如下:
第一 table:付款
cust_id | agreement_id | date | payment |
---|---|---|---|
1 | A | 12/1/20 | 200 |
1 | A | 2/2/21 | 200 |
1 | A | 2/3/21 | 100 |
1 | A | 5/1/21 | 200 |
1 | B | 1/2/21 | 50 |
1 | B | 1/9/21 | 20 |
1 | B | 3/1/21 | 80 |
1 | B | 4/23/21 | 90 |
2 | C | 1/21/21 | 600 |
3 | D | 3/4/21 | 150 |
3 | D | 5/3/21 | 150 |
这是支付数据框的代码:
payments = pd.DataFrame.from_dict({'cust_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 3, 10: 3},
'agreement_id': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'B', 5: 'B', 6: 'B', 7: 'B',
8: 'C', 9: 'D', 10: 'D'},
'date': {0: '12/1/20', 1: '2/2/21', 2: '2/3/21', 3: '5/1/21', 4: '1/2/21', 5: '1/9/21',
6: '3/1/21', 7: '4/23/21', 8: '1/21/21', 9: '3/4/21', 10: '5/3/21'},
'payment': {0: 200, 1: 200, 2: 100, 3: 200, 4: 50, 5: 20, 6: 80, 7: 90, 8: 600, 9: 150, 10: 150}})
第二个table:同意
agreement_id | activation | term_months | total_fee |
---|---|---|---|
A | 12/1/20 | 24 | 4800 |
B | 1/21/21 | 6 | 600 |
C | 1/21/21 | 6 | 600 |
D | 3/4/21 | 6 | 300 |
这是协议数据框的代码:
agreement = pd.DataFrame.from_dict({'agreement_id': {0: 'A', 1: 'B', 2: 'C', 3: 'D'}, 'activation': {0: '12/1/20', 1: '1/2/21', 2: '1/21/21', 3: '3/4/21'}, 'term_months': {0: 24, 1: 6, 2: 6, 3: 6}, 'total_fee': {0: 4800, 1: 300, 2: 600, 3: 300}})
我想要的结果如下:
cust_id | agreement_id | date | payment |
---|---|---|---|
1 | A | 12/1/20 | 116.67 |
1 | A | 1/1/21 | 116.67 |
1 | A | 2/1/21 | 116.67 |
1 | A | 3/1/21 | 116.67 |
1 | A | 4/1/21 | 116.67 |
1 | A | 5/1/21 | 116.67 |
1 | B | 1/1/21 | 60 |
1 | B | 2/1/21 | 60 |
1 | B | 3/1/21 | 60 |
1 | B | 4/1/21 | 60 |
2 | C | 1/1/21 | 100 |
2 | C | 2/1/21 | 100 |
2 | C | 3/1/21 | 100 |
2 | C | 4/1/21 | 100 |
2 | C | 5/1/21 | 100 |
2 | C | 6/1/21 | 100 |
3 | D | 3/1/21 | 50 |
3 | D | 4/1/21 | 50 |
3 | D | 5/1/21 | 50 |
3 | D | 6/1/21 | 50 |
3 | D | 7/1/21 | 50 |
3 | D | 8/1/21 | 50 |
或者,代码形式:
cust_id agreement_id date payment
0 1 A 12/1/20 116.67
1 1 A 1/1/21 116.67
2 1 A 2/1/21 116.67
3 1 A 3/1/21 116.67
4 1 A 4/1/21 116.67
5 1 A 5/1/21 116.67
6 1 B 1/1/21 60.00
7 1 B 2/1/21 60.00
8 1 B 3/1/21 60.00
9 1 B 4/1/21 60.00
10 2 C 1/1/21 100.00
11 2 C 2/1/21 100.00
12 2 C 3/1/21 100.00
13 2 C 4/1/21 100.00
14 2 C 5/1/21 100.00
15 2 C 6/1/21 100.00
16 3 D 3/1/21 50.00
17 3 D 4/1/21 50.00
18 3 D 5/1/21 50.00
19 3 D 6/1/21 50.00
20 3 D 7/1/21 50.00
21 3 D 8/1/21 50.00
激活与首次付款日期相同。
我尝试使用以下代码(由 AlexK 建议)创建另一列,但它仅适用于总付款少于总费用的情况。但是,当总付款等于总费用时,我需要从付款开始到月末相应地分配付款(开始加上以月为单位的条款)。
payments['date'] = pd.to_datetime(payments['date'])
resampled_payments = (payments
.set_index('date')
.groupby(['cust_id', 'agreement_id'])
.resample('MS')
.agg({'payment': sum})
.reset_index()
)
resampled_payments['avg_monthly_payment'] = (resampled_payments
.groupby(['cust_id', 'agreement_id'])['payment']
.transform('mean')
)
这是 R
解决方案(因为你也用 R 标记了它)
#load libraries
library(tidyverse)
library(lubridate)
pymts <- read.table(text = "cust_id agreement_id date payment
1 A 12/1/20 200
1 A 2/2/21 200
1 A 2/3/21 100
1 A 5/1/21 200
1 B 1/2/21 50
1 B 1/9/21 20
1 B 3/1/21 80
1 B 4/23/21 90
2 C 1/21/21 600
3 D 3/4/21 150
3 D 5/3/21 150", header = T)
agmt <- read.table(text = "agreement_id activation term_months total_fee
A 12/1/20 24 4800
B 1/21/21 6 600
C 1/21/21 6 600
D 3/4/21 6 300", header = T)
#final code
final<- pymts %>% mutate(date = as.Date(date, "%m/%d/%y")) %>%
left_join(agmt %>% mutate(activation = as.Date(activation, "%m/%d/%y")), by = "agreement_id") %>%
group_by(cust_id, agreement_id) %>%
mutate(d = n(),
date = floor_date(date, "month")) %>%
complete(date = seq.Date(from = min(date), by = "month", length.out = ifelse(sum(payment) == first(total_fee),
first(term_months),
(year(max(date)) -
year(min(date)))*12 +
month(max(date)) -
month(min(date)) +1))) %>%
mutate(payment = sum(payment, na.rm = T)) %>%
filter(!duplicated(date)) %>%
mutate(payment = payment/n()) %>%
select(1:4) %>% ungroup()
final
# A tibble: 22 x 4
cust_id agreement_id date payment
<int> <chr> <date> <dbl>
1 1 A 2020-12-01 117.
2 1 A 2021-01-01 117.
3 1 A 2021-02-01 117.
4 1 A 2021-03-01 117.
5 1 A 2021-04-01 117.
6 1 A 2021-05-01 117.
7 1 B 2021-01-01 60
8 1 B 2021-02-01 60
9 1 B 2021-03-01 60
10 1 B 2021-04-01 60
# ... with 12 more rows
鉴于您的数据框,这应该可行
from dateutil.relativedelta import relativedelta
# Transofrm column to date
payments['date']= pd.to_datetime(payments['date'])
agreement['activation']= pd.to_datetime(agreement['activation'])
final =pd.merge(payments,agreement,on='agreement_id',how='left')
# set date to beginning of month
final['date'] = pd.to_datetime(final.date).dt.to_period('M').dt.to_timestamp()
def set_date_range(df):
if df['payment'].sum() == df['total_fee'].iloc[0]:
return pd.date_range(min(g['date']), periods=df['term_months'].iloc[0], freq='M')
else:
return pd.date_range(min(g['date']),
max(g['date'])+relativedelta(months=+1), freq='M' )
# Create dataframe with dates
seq_df = pd.DataFrame()
for i,g in final.groupby(['cust_id', 'agreement_id']):
seq_df = pd.concat([seq_df,
pd.DataFrame({'cust_id': i[0], 'agreement_id': i[1], 'date': set_date_range(g)})])
# Set date to beginnig of month
seq_df['date'] = pd.to_datetime(seq_df.date).dt.to_period('M').dt.to_timestamp()
final = (pd.concat([final, seq_df], sort=True)
.sort_values(['cust_id', 'agreement_id', 'date'])
.reset_index(drop=True)
.reindex(columns=final.columns))
final['payment'] = final.groupby(by=['cust_id', 'agreement_id'])["payment"].transform("sum")
final = final.drop_duplicates(['cust_id', 'agreement_id', 'date'])
final['n'] = final.groupby(by=['cust_id', 'agreement_id'])["cust_id"].transform("count")
final['payment_due'] = final['payment']/final['n']
final[['cust_id','agreement_id','date', 'payment_due']]
我无法完全复制管道形式 tidyverse
但输出应该匹配。最难的部分是 seq_df
的创建,但它应该没问题(针对更一般的用例对其进行双重测试)