如何使用 pandas 将每日数据值转换为最新可用日历周内天数的百分比差异?
How to turn daily data values into % difference for the days within the latest available calendar week with pandas?
我有一个 CSV 文件,其中包含过去 30 天采用以下格式的每日数据。但是,如果最近添加了特定 ID,则预计它的行数较少(请参阅 ID=2,数据仅为 2 天):
my.csv
date ID Name Value1 Value2 Value3
07-09-2020 1 ACME 111 3000 123
08-09-2020 1 ACME 222 2500 345
09-09-2020 1 ACME 333 4500 456
10-09-2020 1 ACME 444 1000 567
11-09-2020 1 ACME 555 9000 678
12-09-2020 1 ACME 666 400 789
13-09-2020 1 ACME 666 450 789
14-09-2020 1 ACME 666 444 789
12-09-2020 2 EMCA 111 999 123
13-09-2020 2 EMCA 222 888 345
#...
我正在寻找一个解决方案:
- 获取每个 ID 的最新完整日历周的数据(现在我应该忽略 14-09-2020 和 07-09-2020 之前的任何日期,但每次我都应该检查最新的完全可用的日历周因为日期在文件中不断变化)
- 为列 Value2
中的值创建新的数据框,计算出这个完整日历周的每一天之间的百分比差异
- 计算整周的平均差异百分比
- 将数据帧保存到新的 CSV 文件
每个 ID 的期望输出:
ID Name 07-09-2020 % Difference 08-09-2020 % Difference 09-09-2020 % Difference 10-09-2020 % Difference 11-09-2020 % Difference 12-09-2020 % Difference 13-09-2020 Weekly % Difference Average
1 ACME 3000 -0.166667 2500 0.8 4500 -0.777778 1000 8.0 9000 -0.955556 400 0.125000 450 1.170833
2 EMCA N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 999 -0.111111 888 -0.111111
到目前为止我的代码:
import pandas as pd
from datetime import timedelta
import datetime
data = pd.read_csv("path/to/my,csv", quotechar='"')
#generate latest full calendar week dates
today = datetime.date.today()
weekday = today.weekday()
start_delta = datetime.timedelta(days=weekday, weeks=1)
week_dates = []
for day in range(7):
week_dates.append(start_of_week + datetime.timedelta(days=day))
#check if latest full calendar week dates are available in my.csv
# if any of the days of the week for latest calendar week is not present, then select dates for the week before this week
last_week_dates = []
for i in week_dates:
last_week_dates.append(i.strftime("%d-%m-%Y"))
for i in last_week_dates:
checkDates = data['date'].isin(last_week_dates)
if any(x == False for x in checkDates):
for i in range(7,14):
print (today - timedelta(days=i)
#get values from the column 'Value2' for the previous week (if last week dates are not in the file)
#save values as columns in new dataframe
#calculate %difference and weekly avg
else:
#get values from the column 'Value2' for the last week
#save values as columns in new dataframe
#calculate %difference and weekly avg
finalData.to_csv("path/to/output.csv", index=False)
有人可以帮忙吗?提前谢谢你!
内联评论
# ensure 'date' is of <type datetime>
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
# select last full calendar week
end = pd.Timestamp.today().normalize()
if end.weekday() != 6:
end -= pd.Timedelta(days=end.weekday() + 1)
out = data.loc[
data['date'].between(end - pd.Timedelta(days=6), end)
]
# cast back to string, to control the way it is printed
out['date'] = out['date'].dt.strftime('%d-%m-%Y')
# calculate and reshape
out = out.set_index(['date', 'ID', 'Name'])['Value2'].to_frame()
out['Difference'] = (
out.groupby('ID').transform('pct_change')
)
out = out.unstack('date')
out.sort_index(axis=1, level='date', kind='mergesort', inplace=True)
out.dropna(axis=1, how='all', inplace=True)
out = out.swaplevel(0, 1, axis=1)
out['Weekly Difference Average'] = (
out.loc[:, (slice(None), 'Difference')]
.mean(axis=1)
)
输出
date 07-09-2020 08-09-2020 09-09-2020 10-09-2020 \
Value2 Difference Value2 Difference Value2 Difference Value2
ID Name
1 ACME 3000.0 -0.166667 2500.0 0.8 4500.0 -0.777778 1000.0
2 EMCA NaN NaN NaN NaN NaN NaN NaN
date 11-09-2020 12-09-2020 13-09-2020 \
Difference Value2 Difference Value2 Difference Value2
ID Name
1 ACME 8.0 9000.0 -0.955556 400.0 0.125000 450.0
2 EMCA NaN NaN NaN 999.0 -0.111111 888.0
date Weekly Difference Average
ID Name
1 ACME 1.170833
2 EMCA -0.111111
那你可以用df.to_csv().
我有一个 CSV 文件,其中包含过去 30 天采用以下格式的每日数据。但是,如果最近添加了特定 ID,则预计它的行数较少(请参阅 ID=2,数据仅为 2 天):
my.csv
date ID Name Value1 Value2 Value3
07-09-2020 1 ACME 111 3000 123
08-09-2020 1 ACME 222 2500 345
09-09-2020 1 ACME 333 4500 456
10-09-2020 1 ACME 444 1000 567
11-09-2020 1 ACME 555 9000 678
12-09-2020 1 ACME 666 400 789
13-09-2020 1 ACME 666 450 789
14-09-2020 1 ACME 666 444 789
12-09-2020 2 EMCA 111 999 123
13-09-2020 2 EMCA 222 888 345
#...
我正在寻找一个解决方案:
- 获取每个 ID 的最新完整日历周的数据(现在我应该忽略 14-09-2020 和 07-09-2020 之前的任何日期,但每次我都应该检查最新的完全可用的日历周因为日期在文件中不断变化)
- 为列 Value2 中的值创建新的数据框,计算出这个完整日历周的每一天之间的百分比差异
- 计算整周的平均差异百分比
- 将数据帧保存到新的 CSV 文件
每个 ID 的期望输出:
ID Name 07-09-2020 % Difference 08-09-2020 % Difference 09-09-2020 % Difference 10-09-2020 % Difference 11-09-2020 % Difference 12-09-2020 % Difference 13-09-2020 Weekly % Difference Average
1 ACME 3000 -0.166667 2500 0.8 4500 -0.777778 1000 8.0 9000 -0.955556 400 0.125000 450 1.170833
2 EMCA N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 999 -0.111111 888 -0.111111
到目前为止我的代码:
import pandas as pd
from datetime import timedelta
import datetime
data = pd.read_csv("path/to/my,csv", quotechar='"')
#generate latest full calendar week dates
today = datetime.date.today()
weekday = today.weekday()
start_delta = datetime.timedelta(days=weekday, weeks=1)
week_dates = []
for day in range(7):
week_dates.append(start_of_week + datetime.timedelta(days=day))
#check if latest full calendar week dates are available in my.csv
# if any of the days of the week for latest calendar week is not present, then select dates for the week before this week
last_week_dates = []
for i in week_dates:
last_week_dates.append(i.strftime("%d-%m-%Y"))
for i in last_week_dates:
checkDates = data['date'].isin(last_week_dates)
if any(x == False for x in checkDates):
for i in range(7,14):
print (today - timedelta(days=i)
#get values from the column 'Value2' for the previous week (if last week dates are not in the file)
#save values as columns in new dataframe
#calculate %difference and weekly avg
else:
#get values from the column 'Value2' for the last week
#save values as columns in new dataframe
#calculate %difference and weekly avg
finalData.to_csv("path/to/output.csv", index=False)
有人可以帮忙吗?提前谢谢你!
内联评论
# ensure 'date' is of <type datetime>
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
# select last full calendar week
end = pd.Timestamp.today().normalize()
if end.weekday() != 6:
end -= pd.Timedelta(days=end.weekday() + 1)
out = data.loc[
data['date'].between(end - pd.Timedelta(days=6), end)
]
# cast back to string, to control the way it is printed
out['date'] = out['date'].dt.strftime('%d-%m-%Y')
# calculate and reshape
out = out.set_index(['date', 'ID', 'Name'])['Value2'].to_frame()
out['Difference'] = (
out.groupby('ID').transform('pct_change')
)
out = out.unstack('date')
out.sort_index(axis=1, level='date', kind='mergesort', inplace=True)
out.dropna(axis=1, how='all', inplace=True)
out = out.swaplevel(0, 1, axis=1)
out['Weekly Difference Average'] = (
out.loc[:, (slice(None), 'Difference')]
.mean(axis=1)
)
输出
date 07-09-2020 08-09-2020 09-09-2020 10-09-2020 \
Value2 Difference Value2 Difference Value2 Difference Value2
ID Name
1 ACME 3000.0 -0.166667 2500.0 0.8 4500.0 -0.777778 1000.0
2 EMCA NaN NaN NaN NaN NaN NaN NaN
date 11-09-2020 12-09-2020 13-09-2020 \
Difference Value2 Difference Value2 Difference Value2
ID Name
1 ACME 8.0 9000.0 -0.955556 400.0 0.125000 450.0
2 EMCA NaN NaN NaN 999.0 -0.111111 888.0
date Weekly Difference Average
ID Name
1 ACME 1.170833
2 EMCA -0.111111
那你可以用df.to_csv().