解决日期和数据框之间这个问题的最佳方法是什么?
What is the best way to solve this problem between dates and dataframes?
在这个问题中,你有两个数据框,一个是最后发布的价格,通常是今天。在另一个数据框中,我们有所有的发布。
我们的想法是,我们可以使用这两个数据框,结果是当前价格和最近第二天的价格之间的差异。重复今天并忽略倒数第二个日期。最难的部分是这种差异需要遵循周期性模式。所以如果日期类型是星期五,则只能与之前的星期五相差。
以重复行的方式,除了不可用的价格。
第一个数据帧:
import pandas as pd
data = {
'Type': ['Product1', 'Product2', 'Product3'],
'State': ['New York', 'Washington', 'Illinois'],
'Date':['25/03/2022','25/03/2022','25/03/2022'],
'Price':['5.00','4.00','4.00'],
'Type-Date':['Friday (only)','Friday (only)','Monday, Wednesday, Friday (only)']}
df_1 = pd.DataFrame(data)
df_1
Type State Date Price Name-Date
0 Product1 New York 25/03/2022 5.00 Friday (only)
1 Product2 Washington 25/03/2022 4.00 Friday (only)
2 Product3 Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
第二个数据帧:
data = {'Type': ['Product1', 'Product1', 'Product1','Product2','Product2','Product2','Product3','Product3','Product3'],
'State': ['New York', 'New York','New York', 'Washington', 'Washington', 'Washington', 'Illinois', 'Illinois', 'Illinois'],
'Date':['25/03/2022','04/03/2022','25/02/2022', '25/03/2022', '11/03/2022', '04/03/2022', '25/03/2022', '16/03/2022', '14/03/2022'],
'Price':['5.00','4.00','4.00','4.00','3.00','2.00','4.00','3.00','4.00'],
'Type-Date':['Friday (only)','Friday (only)','Friday (only)','Friday (only)','Friday (only)','Friday (only)',
'Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)']}
df_2 = pd.DataFrame(data)
df_2
Type State Date Price Type-Date
0 Product1 New York 25/03/2022 5.00 Friday (only)
1 Product1 New York 04/03/2022 4.00 Friday (only)
2 Product1 New York 25/02/2022 4.00 Friday (only)
3 Product2 Washington 25/03/2022 4.00 Friday (only)
4 Product2 Washington 11/03/2022 3.00 Friday (only)
5 Product2 Washington 04/03/2022 2.00 Friday (only)
6 Product3 Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
7 Product3 Illinois 16/03/2022 3.00 Monday, Wednesday, Friday (only)
8 Product3 Illinois 14/03/2022 4.00 Monday, Wednesday, Friday (only)
想要的结果
Type State Date Price Type-Date
0 Product1 New York 25/03/2022 5.00 Friday (only)
1 Product1 New York 18/03/2022 NaN Friday (only)
2 Product1 New York 11/03/2022 NaN Friday (only)
3 Product2 Washington 25/03/2022 4.00 Friday (only)
4 Product2 Washington 18/03/2022 NaN Friday (only)
5 Product3 Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
6 Product3 Illinois 23/03/2022 NaN Monday, Wednesday, Friday (only)
7 Product3 Illinois 21/03/2022 NaN Monday, Wednesday, Friday (only)
8 Product3 Illinois 18/03/2022 NaN Monday, Wednesday, Friday (only)
这里有很多,这也意味着可能会出现几种可能的情况,这些情况可能会或可能不会在此答案中预料到。例如,如果在 df_2 中找不到给定类型的 df_1 中的日期,或者给定类型在 df_2 中没有条目,等等
考虑到这一点,这里有一些代码可以产生问题中指定的所需结果:
import pandas as pd
import numpy as np
data = {
'Type': ['Product1', 'Product2', 'Product3'],
'State': ['New York', 'Washington', 'Illinois'],
'Date':['25/03/2022','25/03/2022','25/03/2022'],
'Price':['5.00','4.00','4.00'],
'Type-Date':['Friday (only)','Friday (only)','Monday, Wednesday, Friday (only)']}
df_1 = pd.DataFrame(data)
data = {'Type': ['Product1', 'Product1', 'Product1','Product2','Product2','Product2','Product3','Product3','Product3'],
'State': ['New York', 'New York','New York', 'Washington', 'Washington', 'Washington', 'Illinois', 'Illinois', 'Illinois'],
'Date':['25/03/2022','04/03/2022','25/02/2022', '25/03/2022', '11/03/2022', '04/03/2022', '25/03/2022', '16/03/2022', '14/03/2022'],
'Price':['5.00','4.00','4.00','4.00','3.00','2.00','4.00','3.00','4.00'],
'Type-Date':['Friday (only)','Friday (only)','Friday (only)','Friday (only)','Friday (only)','Friday (only)',
'Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)']}
df_2 = pd.DataFrame(data)
'''
Objective:
Create a dataframe which for each Type contains:
- today's Date and Price from df_1
- prior Date values with Price of NaN going back in time according to the Type's corresponding Type-Date value, back to but not including the penultimate date for which a price is available in df_2
'''
dayStrToInt = {'Monday':0,'Tuesday':1,'Wednesday':2,'Thursday':3,'Friday':4,'Saturday':5,'Sunday':6}
freqByType = {}
def setFreqByType(row):
weekdays = [s.strip() for s in row['Type-Date'].replace('(only)', '').split(',')]
if not weekdays:
raise ValueError(f"No weekdays found in Type-Date {repr(row['Type-Date'])}")
days = []
for w in weekdays:
if w not in dayStrToInt:
raise ValueError(f'Bad day-of-week string {w}')
days.append(dayStrToInt[w])
freqByType[row['Type']] = days
import datetime
datePriceListByType = []
def compileDatePriceByType(row):
curType = row['Type']
curDate = datetime.datetime.strptime(row['Date'], '%d/%m/%Y').date()
allDates = [datetime.datetime.strptime(dateStr, '%d/%m/%Y').date() for dateStr in df_2[df_2['Type']==row['Type']]['Date']]
allDateStrs = [dt.strftime('%d/%m/%Y') for dt in allDates]
minDate = min(allDates)
newDates = [curDate]
dt = curDate
days = freqByType[curType]
while dt > minDate:
curWD = dt.weekday()
nextWD = curWD
while nextWD not in days:
nextWD = (nextWD - 1) % 7
iWD = (days.index(nextWD) - (1 if nextWD == curWD else 0)) % len(days)
dt -= datetime.timedelta(days=(curWD - days[iWD]) % 7 if curWD != days[iWD] else 7)
if dt in allDates:
break
if dt > minDate:
newDates.append(dt)
datePrice = [[dt.strftime('%d/%m/%Y') for dt in newDates], [row['Price']] + [np.nan]*(len(newDates) - 1)]
datePriceListByType.append(datePrice)
df_1.apply(setFreqByType, axis=1)
df_1.apply(compileDatePriceByType, axis=1)
df_result = df_1
df_result[['Date', 'Price']] = pd.DataFrame(datePriceListByType, columns=['Date', 'Price'])
df_result = df_result.explode(['Date', 'Price'], ignore_index=True)
print(df_result)
输出:
Type State Date Price Type-Date
0 Product1 New York 25/03/2022 5.00 Friday (only)
1 Product1 New York 18/03/2022 NaN Friday (only)
2 Product1 New York 11/03/2022 NaN Friday (only)
3 Product2 Washington 25/03/2022 4.00 Friday (only)
4 Product2 Washington 18/03/2022 NaN Friday (only)
5 Product3 Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
6 Product3 Illinois 23/03/2022 NaN Monday, Wednesday, Friday (only)
7 Product3 Illinois 21/03/2022 NaN Monday, Wednesday, Friday (only)
8 Product3 Illinois 18/03/2022 NaN Monday, Wednesday, Friday (only)
更新: 键是(类型,区域)而不只是类型。
如果需要 Type-Date(即每周计划)根据 multi-column 键(例如(类型,地区))而变化,也可以实现。虽然可以根据键列列表对此进行概括,但我将仅分享一个硬编码类型和区域两列的示例:
import pandas as pd
import numpy as np
data = {
'Type': ['Product1', 'Product1', 'Product2', 'Product3'],
'Region': ['Northeast', 'Southeast', 'Northwest', 'Midwest'],
'State': ['New York', 'Florida', 'Washington', 'Illinois'],
'Date':['25/03/2022','25/03/2022','25/03/2022','25/03/2022'],
'Price':['5.00','4.50','4.00','4.00'],
'Type-Date':['Friday (only)','Tuesday, Friday (only)','Friday (only)','Monday, Wednesday, Friday (only)']}
df_1 = pd.DataFrame(data)
print(f"df_1\n{df_1}")
data = {'Type': ['Product1', 'Product1', 'Product1','Product1', 'Product1', 'Product1','Product2','Product2','Product2','Product3','Product3','Product3'],
'Region': ['Northeast', 'Northeast', 'Northeast', 'Southeast', 'Southeast', 'Southeast', 'Northwest', 'Northwest', 'Northwest', 'Midwest', 'Midwest', 'Midwest'],
'State': ['New York', 'New York','New York', 'Florida', 'Florida', 'Florida', 'Washington', 'Washington', 'Washington', 'Illinois', 'Illinois', 'Illinois'],
'Date':['25/03/2022','04/03/2022','25/02/2022', '25/03/2022','04/03/2022','25/02/2022', '25/03/2022', '11/03/2022', '04/03/2022', '25/03/2022', '16/03/2022', '14/03/2022'],
'Price':['5.00','4.00','4.00','4.50','4.25','4.10','4.00','3.00','2.00','4.00','3.00','4.00'],
'Type-Date':['Friday (only)','Friday (only)','Friday (only)','Tuesday, Friday (only)','Tuesday, Friday (only)','Tuesday, Friday (only)','Friday (only)','Friday (only)','Friday (only)',
'Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)']}
df_2 = pd.DataFrame(data)
print(f"df_2\n{df_2}")
'''
Objective:
Create a dataframe which for each (Type, Region) pair contains:
- today's Date and Price from df_1
- prior Date values with Price of NaN going back in time according to the (Type, Region) pair's corresponding Type-Date value, back to but not including the penultimate date for which a price is available in df_2
'''
dayStrToInt = {'Monday':0,'Tuesday':1,'Wednesday':2,'Thursday':3,'Friday':4,'Saturday':5,'Sunday':6}
freqByTypeRegion = {}
def setFreqByTypeRegion(row):
weekdays = [s.strip() for s in row['Type-Date'].replace('(only)', '').split(',')]
if not weekdays:
raise ValueError(f"No weekdays found in Type-Date {repr(row['Type-Date'])}")
days = []
for w in weekdays:
if w not in dayStrToInt:
raise ValueError(f'Bad day-of-week string {w}')
days.append(dayStrToInt[w])
freqByTypeRegion[(row['Type'], row['Region'])] = days
import datetime
datePriceListByTypeRegion = []
def compileDatePriceByTypeRegion(row):
curTypeRegion = (row['Type'], row['Region'])
curDate = datetime.datetime.strptime(row['Date'], '%d/%m/%Y').date()
allDates = [datetime.datetime.strptime(dateStr, '%d/%m/%Y').date() for dateStr in df_2[(df_2['Type']==row['Type']) & (df_2['Region']==row['Region'])]['Date']]
allDateStrs = [dt.strftime('%d/%m/%Y') for dt in allDates]
minDate = min(allDates)
newDates = [curDate]
dt = curDate
days = freqByTypeRegion[curTypeRegion]
while dt > minDate:
curWD = dt.weekday()
nextWD = curWD
while nextWD not in days:
nextWD = (nextWD - 1) % 7
iWD = (days.index(nextWD) - (1 if nextWD == curWD else 0)) % len(days)
dt -= datetime.timedelta(days=(curWD - days[iWD]) % 7 if curWD != days[iWD] else 7)
if dt in allDates:
break
if dt > minDate:
newDates.append(dt)
datePrice = [[dt.strftime('%d/%m/%Y') for dt in newDates], [row['Price']] + [np.nan]*(len(newDates) - 1)]
datePriceListByTypeRegion.append(datePrice)
df_1.apply(setFreqByTypeRegion, axis=1)
df_1.apply(compileDatePriceByTypeRegion, axis=1)
df_result = df_1
df_result[['Date', 'Price']] = pd.DataFrame(datePriceListByTypeRegion, columns=['Date', 'Price'])
df_result = df_result.explode(['Date', 'Price'], ignore_index=True)
print(f"df_result\n{df_result}")
输出:
df_1
Type Region State Date Price Type-Date
0 Product1 Northeast New York 25/03/2022 5.00 Friday (only)
1 Product1 Southeast Florida 25/03/2022 4.50 Tuesday, Friday (only)
2 Product2 Northwest Washington 25/03/2022 4.00 Friday (only)
3 Product3 Midwest Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
df_2
Type Region State Date Price Type-Date
0 Product1 Northeast New York 25/03/2022 5.00 Friday (only)
1 Product1 Northeast New York 04/03/2022 4.00 Friday (only)
2 Product1 Northeast New York 25/02/2022 4.00 Friday (only)
3 Product1 Southeast Florida 25/03/2022 4.50 Tuesday, Friday (only)
4 Product1 Southeast Florida 04/03/2022 4.25 Tuesday, Friday (only)
5 Product1 Southeast Florida 25/02/2022 4.10 Tuesday, Friday (only)
6 Product2 Northwest Washington 25/03/2022 4.00 Friday (only)
7 Product2 Northwest Washington 11/03/2022 3.00 Friday (only)
8 Product2 Northwest Washington 04/03/2022 2.00 Friday (only)
9 Product3 Midwest Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
10 Product3 Midwest Illinois 16/03/2022 3.00 Monday, Wednesday, Friday (only)
11 Product3 Midwest Illinois 14/03/2022 4.00 Monday, Wednesday, Friday (only)
df_result
Type Region State Date Price Type-Date
0 Product1 Northeast New York 25/03/2022 5.00 Friday (only)
1 Product1 Northeast New York 18/03/2022 NaN Friday (only)
2 Product1 Northeast New York 11/03/2022 NaN Friday (only)
3 Product1 Southeast Florida 25/03/2022 4.50 Tuesday, Friday (only)
4 Product1 Southeast Florida 22/03/2022 NaN Tuesday, Friday (only)
5 Product1 Southeast Florida 18/03/2022 NaN Tuesday, Friday (only)
6 Product1 Southeast Florida 15/03/2022 NaN Tuesday, Friday (only)
7 Product1 Southeast Florida 11/03/2022 NaN Tuesday, Friday (only)
8 Product1 Southeast Florida 08/03/2022 NaN Tuesday, Friday (only)
9 Product2 Northwest Washington 25/03/2022 4.00 Friday (only)
10 Product2 Northwest Washington 18/03/2022 NaN Friday (only)
11 Product3 Midwest Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
12 Product3 Midwest Illinois 23/03/2022 NaN Monday, Wednesday, Friday (only)
13 Product3 Midwest Illinois 21/03/2022 NaN Monday, Wednesday, Friday (only)
14 Product3 Midwest Illinois 18/03/2022 NaN Monday, Wednesday, Friday (only)
在这个问题中,你有两个数据框,一个是最后发布的价格,通常是今天。在另一个数据框中,我们有所有的发布。
我们的想法是,我们可以使用这两个数据框,结果是当前价格和最近第二天的价格之间的差异。重复今天并忽略倒数第二个日期。最难的部分是这种差异需要遵循周期性模式。所以如果日期类型是星期五,则只能与之前的星期五相差。
以重复行的方式,除了不可用的价格。
第一个数据帧:
import pandas as pd
data = {
'Type': ['Product1', 'Product2', 'Product3'],
'State': ['New York', 'Washington', 'Illinois'],
'Date':['25/03/2022','25/03/2022','25/03/2022'],
'Price':['5.00','4.00','4.00'],
'Type-Date':['Friday (only)','Friday (only)','Monday, Wednesday, Friday (only)']}
df_1 = pd.DataFrame(data)
df_1
Type State Date Price Name-Date
0 Product1 New York 25/03/2022 5.00 Friday (only)
1 Product2 Washington 25/03/2022 4.00 Friday (only)
2 Product3 Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
第二个数据帧:
data = {'Type': ['Product1', 'Product1', 'Product1','Product2','Product2','Product2','Product3','Product3','Product3'],
'State': ['New York', 'New York','New York', 'Washington', 'Washington', 'Washington', 'Illinois', 'Illinois', 'Illinois'],
'Date':['25/03/2022','04/03/2022','25/02/2022', '25/03/2022', '11/03/2022', '04/03/2022', '25/03/2022', '16/03/2022', '14/03/2022'],
'Price':['5.00','4.00','4.00','4.00','3.00','2.00','4.00','3.00','4.00'],
'Type-Date':['Friday (only)','Friday (only)','Friday (only)','Friday (only)','Friday (only)','Friday (only)',
'Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)']}
df_2 = pd.DataFrame(data)
df_2
Type State Date Price Type-Date
0 Product1 New York 25/03/2022 5.00 Friday (only)
1 Product1 New York 04/03/2022 4.00 Friday (only)
2 Product1 New York 25/02/2022 4.00 Friday (only)
3 Product2 Washington 25/03/2022 4.00 Friday (only)
4 Product2 Washington 11/03/2022 3.00 Friday (only)
5 Product2 Washington 04/03/2022 2.00 Friday (only)
6 Product3 Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
7 Product3 Illinois 16/03/2022 3.00 Monday, Wednesday, Friday (only)
8 Product3 Illinois 14/03/2022 4.00 Monday, Wednesday, Friday (only)
想要的结果
Type State Date Price Type-Date
0 Product1 New York 25/03/2022 5.00 Friday (only)
1 Product1 New York 18/03/2022 NaN Friday (only)
2 Product1 New York 11/03/2022 NaN Friday (only)
3 Product2 Washington 25/03/2022 4.00 Friday (only)
4 Product2 Washington 18/03/2022 NaN Friday (only)
5 Product3 Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
6 Product3 Illinois 23/03/2022 NaN Monday, Wednesday, Friday (only)
7 Product3 Illinois 21/03/2022 NaN Monday, Wednesday, Friday (only)
8 Product3 Illinois 18/03/2022 NaN Monday, Wednesday, Friday (only)
这里有很多,这也意味着可能会出现几种可能的情况,这些情况可能会或可能不会在此答案中预料到。例如,如果在 df_2 中找不到给定类型的 df_1 中的日期,或者给定类型在 df_2 中没有条目,等等
考虑到这一点,这里有一些代码可以产生问题中指定的所需结果:
import pandas as pd
import numpy as np
data = {
'Type': ['Product1', 'Product2', 'Product3'],
'State': ['New York', 'Washington', 'Illinois'],
'Date':['25/03/2022','25/03/2022','25/03/2022'],
'Price':['5.00','4.00','4.00'],
'Type-Date':['Friday (only)','Friday (only)','Monday, Wednesday, Friday (only)']}
df_1 = pd.DataFrame(data)
data = {'Type': ['Product1', 'Product1', 'Product1','Product2','Product2','Product2','Product3','Product3','Product3'],
'State': ['New York', 'New York','New York', 'Washington', 'Washington', 'Washington', 'Illinois', 'Illinois', 'Illinois'],
'Date':['25/03/2022','04/03/2022','25/02/2022', '25/03/2022', '11/03/2022', '04/03/2022', '25/03/2022', '16/03/2022', '14/03/2022'],
'Price':['5.00','4.00','4.00','4.00','3.00','2.00','4.00','3.00','4.00'],
'Type-Date':['Friday (only)','Friday (only)','Friday (only)','Friday (only)','Friday (only)','Friday (only)',
'Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)']}
df_2 = pd.DataFrame(data)
'''
Objective:
Create a dataframe which for each Type contains:
- today's Date and Price from df_1
- prior Date values with Price of NaN going back in time according to the Type's corresponding Type-Date value, back to but not including the penultimate date for which a price is available in df_2
'''
dayStrToInt = {'Monday':0,'Tuesday':1,'Wednesday':2,'Thursday':3,'Friday':4,'Saturday':5,'Sunday':6}
freqByType = {}
def setFreqByType(row):
weekdays = [s.strip() for s in row['Type-Date'].replace('(only)', '').split(',')]
if not weekdays:
raise ValueError(f"No weekdays found in Type-Date {repr(row['Type-Date'])}")
days = []
for w in weekdays:
if w not in dayStrToInt:
raise ValueError(f'Bad day-of-week string {w}')
days.append(dayStrToInt[w])
freqByType[row['Type']] = days
import datetime
datePriceListByType = []
def compileDatePriceByType(row):
curType = row['Type']
curDate = datetime.datetime.strptime(row['Date'], '%d/%m/%Y').date()
allDates = [datetime.datetime.strptime(dateStr, '%d/%m/%Y').date() for dateStr in df_2[df_2['Type']==row['Type']]['Date']]
allDateStrs = [dt.strftime('%d/%m/%Y') for dt in allDates]
minDate = min(allDates)
newDates = [curDate]
dt = curDate
days = freqByType[curType]
while dt > minDate:
curWD = dt.weekday()
nextWD = curWD
while nextWD not in days:
nextWD = (nextWD - 1) % 7
iWD = (days.index(nextWD) - (1 if nextWD == curWD else 0)) % len(days)
dt -= datetime.timedelta(days=(curWD - days[iWD]) % 7 if curWD != days[iWD] else 7)
if dt in allDates:
break
if dt > minDate:
newDates.append(dt)
datePrice = [[dt.strftime('%d/%m/%Y') for dt in newDates], [row['Price']] + [np.nan]*(len(newDates) - 1)]
datePriceListByType.append(datePrice)
df_1.apply(setFreqByType, axis=1)
df_1.apply(compileDatePriceByType, axis=1)
df_result = df_1
df_result[['Date', 'Price']] = pd.DataFrame(datePriceListByType, columns=['Date', 'Price'])
df_result = df_result.explode(['Date', 'Price'], ignore_index=True)
print(df_result)
输出:
Type State Date Price Type-Date
0 Product1 New York 25/03/2022 5.00 Friday (only)
1 Product1 New York 18/03/2022 NaN Friday (only)
2 Product1 New York 11/03/2022 NaN Friday (only)
3 Product2 Washington 25/03/2022 4.00 Friday (only)
4 Product2 Washington 18/03/2022 NaN Friday (only)
5 Product3 Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
6 Product3 Illinois 23/03/2022 NaN Monday, Wednesday, Friday (only)
7 Product3 Illinois 21/03/2022 NaN Monday, Wednesday, Friday (only)
8 Product3 Illinois 18/03/2022 NaN Monday, Wednesday, Friday (only)
更新: 键是(类型,区域)而不只是类型。
如果需要 Type-Date(即每周计划)根据 multi-column 键(例如(类型,地区))而变化,也可以实现。虽然可以根据键列列表对此进行概括,但我将仅分享一个硬编码类型和区域两列的示例:
import pandas as pd
import numpy as np
data = {
'Type': ['Product1', 'Product1', 'Product2', 'Product3'],
'Region': ['Northeast', 'Southeast', 'Northwest', 'Midwest'],
'State': ['New York', 'Florida', 'Washington', 'Illinois'],
'Date':['25/03/2022','25/03/2022','25/03/2022','25/03/2022'],
'Price':['5.00','4.50','4.00','4.00'],
'Type-Date':['Friday (only)','Tuesday, Friday (only)','Friday (only)','Monday, Wednesday, Friday (only)']}
df_1 = pd.DataFrame(data)
print(f"df_1\n{df_1}")
data = {'Type': ['Product1', 'Product1', 'Product1','Product1', 'Product1', 'Product1','Product2','Product2','Product2','Product3','Product3','Product3'],
'Region': ['Northeast', 'Northeast', 'Northeast', 'Southeast', 'Southeast', 'Southeast', 'Northwest', 'Northwest', 'Northwest', 'Midwest', 'Midwest', 'Midwest'],
'State': ['New York', 'New York','New York', 'Florida', 'Florida', 'Florida', 'Washington', 'Washington', 'Washington', 'Illinois', 'Illinois', 'Illinois'],
'Date':['25/03/2022','04/03/2022','25/02/2022', '25/03/2022','04/03/2022','25/02/2022', '25/03/2022', '11/03/2022', '04/03/2022', '25/03/2022', '16/03/2022', '14/03/2022'],
'Price':['5.00','4.00','4.00','4.50','4.25','4.10','4.00','3.00','2.00','4.00','3.00','4.00'],
'Type-Date':['Friday (only)','Friday (only)','Friday (only)','Tuesday, Friday (only)','Tuesday, Friday (only)','Tuesday, Friday (only)','Friday (only)','Friday (only)','Friday (only)',
'Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)']}
df_2 = pd.DataFrame(data)
print(f"df_2\n{df_2}")
'''
Objective:
Create a dataframe which for each (Type, Region) pair contains:
- today's Date and Price from df_1
- prior Date values with Price of NaN going back in time according to the (Type, Region) pair's corresponding Type-Date value, back to but not including the penultimate date for which a price is available in df_2
'''
dayStrToInt = {'Monday':0,'Tuesday':1,'Wednesday':2,'Thursday':3,'Friday':4,'Saturday':5,'Sunday':6}
freqByTypeRegion = {}
def setFreqByTypeRegion(row):
weekdays = [s.strip() for s in row['Type-Date'].replace('(only)', '').split(',')]
if not weekdays:
raise ValueError(f"No weekdays found in Type-Date {repr(row['Type-Date'])}")
days = []
for w in weekdays:
if w not in dayStrToInt:
raise ValueError(f'Bad day-of-week string {w}')
days.append(dayStrToInt[w])
freqByTypeRegion[(row['Type'], row['Region'])] = days
import datetime
datePriceListByTypeRegion = []
def compileDatePriceByTypeRegion(row):
curTypeRegion = (row['Type'], row['Region'])
curDate = datetime.datetime.strptime(row['Date'], '%d/%m/%Y').date()
allDates = [datetime.datetime.strptime(dateStr, '%d/%m/%Y').date() for dateStr in df_2[(df_2['Type']==row['Type']) & (df_2['Region']==row['Region'])]['Date']]
allDateStrs = [dt.strftime('%d/%m/%Y') for dt in allDates]
minDate = min(allDates)
newDates = [curDate]
dt = curDate
days = freqByTypeRegion[curTypeRegion]
while dt > minDate:
curWD = dt.weekday()
nextWD = curWD
while nextWD not in days:
nextWD = (nextWD - 1) % 7
iWD = (days.index(nextWD) - (1 if nextWD == curWD else 0)) % len(days)
dt -= datetime.timedelta(days=(curWD - days[iWD]) % 7 if curWD != days[iWD] else 7)
if dt in allDates:
break
if dt > minDate:
newDates.append(dt)
datePrice = [[dt.strftime('%d/%m/%Y') for dt in newDates], [row['Price']] + [np.nan]*(len(newDates) - 1)]
datePriceListByTypeRegion.append(datePrice)
df_1.apply(setFreqByTypeRegion, axis=1)
df_1.apply(compileDatePriceByTypeRegion, axis=1)
df_result = df_1
df_result[['Date', 'Price']] = pd.DataFrame(datePriceListByTypeRegion, columns=['Date', 'Price'])
df_result = df_result.explode(['Date', 'Price'], ignore_index=True)
print(f"df_result\n{df_result}")
输出:
df_1
Type Region State Date Price Type-Date
0 Product1 Northeast New York 25/03/2022 5.00 Friday (only)
1 Product1 Southeast Florida 25/03/2022 4.50 Tuesday, Friday (only)
2 Product2 Northwest Washington 25/03/2022 4.00 Friday (only)
3 Product3 Midwest Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
df_2
Type Region State Date Price Type-Date
0 Product1 Northeast New York 25/03/2022 5.00 Friday (only)
1 Product1 Northeast New York 04/03/2022 4.00 Friday (only)
2 Product1 Northeast New York 25/02/2022 4.00 Friday (only)
3 Product1 Southeast Florida 25/03/2022 4.50 Tuesday, Friday (only)
4 Product1 Southeast Florida 04/03/2022 4.25 Tuesday, Friday (only)
5 Product1 Southeast Florida 25/02/2022 4.10 Tuesday, Friday (only)
6 Product2 Northwest Washington 25/03/2022 4.00 Friday (only)
7 Product2 Northwest Washington 11/03/2022 3.00 Friday (only)
8 Product2 Northwest Washington 04/03/2022 2.00 Friday (only)
9 Product3 Midwest Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
10 Product3 Midwest Illinois 16/03/2022 3.00 Monday, Wednesday, Friday (only)
11 Product3 Midwest Illinois 14/03/2022 4.00 Monday, Wednesday, Friday (only)
df_result
Type Region State Date Price Type-Date
0 Product1 Northeast New York 25/03/2022 5.00 Friday (only)
1 Product1 Northeast New York 18/03/2022 NaN Friday (only)
2 Product1 Northeast New York 11/03/2022 NaN Friday (only)
3 Product1 Southeast Florida 25/03/2022 4.50 Tuesday, Friday (only)
4 Product1 Southeast Florida 22/03/2022 NaN Tuesday, Friday (only)
5 Product1 Southeast Florida 18/03/2022 NaN Tuesday, Friday (only)
6 Product1 Southeast Florida 15/03/2022 NaN Tuesday, Friday (only)
7 Product1 Southeast Florida 11/03/2022 NaN Tuesday, Friday (only)
8 Product1 Southeast Florida 08/03/2022 NaN Tuesday, Friday (only)
9 Product2 Northwest Washington 25/03/2022 4.00 Friday (only)
10 Product2 Northwest Washington 18/03/2022 NaN Friday (only)
11 Product3 Midwest Illinois 25/03/2022 4.00 Monday, Wednesday, Friday (only)
12 Product3 Midwest Illinois 23/03/2022 NaN Monday, Wednesday, Friday (only)
13 Product3 Midwest Illinois 21/03/2022 NaN Monday, Wednesday, Friday (only)
14 Product3 Midwest Illinois 18/03/2022 NaN Monday, Wednesday, Friday (only)