解决日期和数据框之间这个问题的最佳方法是什么?

What is the best way to solve this problem between dates and dataframes?

在这个问题中,你有两个数据框,一个是最后发布的价格,通常是今天。在另一个数据框中,我们有所有的发布。

我们的想法是,我们可以使用这两个数据框,结果是当前价格和最近第二天的价格之间的差异。重复今天并忽略倒数第二个日期。最难的部分是这种差异需要遵循周期性模式。所以如果日期类型是星期五,则只能与之前的星期五相差。

以重复行的方式,除了不可用的价格。

第一个数据帧:

import pandas as pd

data = {
'Type': ['Product1', 'Product2', 'Product3'], 
'State': ['New York', 'Washington', 'Illinois'], 
'Date':['25/03/2022','25/03/2022','25/03/2022'], 
'Price':['5.00','4.00','4.00'], 
'Type-Date':['Friday (only)','Friday (only)','Monday, Wednesday, Friday (only)']}

df_1 = pd.DataFrame(data)
df_1

    Type     State      Date        Price   Name-Date
0   Product1 New York   25/03/2022  5.00    Friday (only)
1   Product2 Washington 25/03/2022  4.00    Friday (only)
2   Product3 Illinois   25/03/2022  4.00    Monday, Wednesday, Friday (only)

第二个数据帧:

data = {'Type': ['Product1', 'Product1', 'Product1','Product2','Product2','Product2','Product3','Product3','Product3'], 
'State': ['New York', 'New York','New York', 'Washington', 'Washington', 'Washington', 'Illinois', 'Illinois', 'Illinois'], 
'Date':['25/03/2022','04/03/2022','25/02/2022', '25/03/2022', '11/03/2022', '04/03/2022', '25/03/2022', '16/03/2022', '14/03/2022'], 
'Price':['5.00','4.00','4.00','4.00','3.00','2.00','4.00','3.00','4.00'], 
'Type-Date':['Friday (only)','Friday (only)','Friday (only)','Friday (only)','Friday (only)','Friday (only)',
'Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)']}

df_2 = pd.DataFrame(data)
df_2

    Type     State      Date        Price   Type-Date
0   Product1 New York   25/03/2022  5.00    Friday (only)
1   Product1 New York   04/03/2022  4.00    Friday (only)
2   Product1 New York   25/02/2022  4.00    Friday (only)
3   Product2 Washington 25/03/2022  4.00    Friday (only)
4   Product2 Washington 11/03/2022  3.00    Friday (only)
5   Product2 Washington 04/03/2022  2.00    Friday (only)
6   Product3 Illinois   25/03/2022  4.00    Monday, Wednesday, Friday (only)
7   Product3 Illinois   16/03/2022  3.00    Monday, Wednesday, Friday (only)
8   Product3 Illinois   14/03/2022  4.00    Monday, Wednesday, Friday (only)

想要的结果

    Type     State      Date       Price  Type-Date
0   Product1 New York   25/03/2022 5.00   Friday (only)
1   Product1 New York   18/03/2022 NaN    Friday (only)
2   Product1 New York   11/03/2022 NaN    Friday (only)
3   Product2 Washington 25/03/2022 4.00   Friday (only)
4   Product2 Washington 18/03/2022 NaN    Friday (only)
5   Product3 Illinois   25/03/2022 4.00   Monday, Wednesday, Friday (only)
6   Product3 Illinois   23/03/2022 NaN    Monday, Wednesday, Friday (only)
7   Product3 Illinois   21/03/2022 NaN    Monday, Wednesday, Friday (only)
8   Product3 Illinois   18/03/2022 NaN    Monday, Wednesday, Friday (only)

这里有很多,这也意味着可能会出现几种可能的情况,这些情况可能会或可能不会在此答案中预料到。例如,如果在 df_2 中找不到给定类型的 df_1 中的日期,或者给定类型在 df_2 中没有条目,等等

考虑到这一点,这里有一些代码可以产生问题中指定的所需结果:

import pandas as pd
import numpy as np

data = {
'Type': ['Product1', 'Product2', 'Product3'], 
'State': ['New York', 'Washington', 'Illinois'], 
'Date':['25/03/2022','25/03/2022','25/03/2022'], 
'Price':['5.00','4.00','4.00'], 
'Type-Date':['Friday (only)','Friday (only)','Monday, Wednesday, Friday (only)']}

df_1 = pd.DataFrame(data)

data = {'Type': ['Product1', 'Product1', 'Product1','Product2','Product2','Product2','Product3','Product3','Product3'], 
'State': ['New York', 'New York','New York', 'Washington', 'Washington', 'Washington', 'Illinois', 'Illinois', 'Illinois'], 
'Date':['25/03/2022','04/03/2022','25/02/2022', '25/03/2022', '11/03/2022', '04/03/2022', '25/03/2022', '16/03/2022', '14/03/2022'], 
'Price':['5.00','4.00','4.00','4.00','3.00','2.00','4.00','3.00','4.00'], 
'Type-Date':['Friday (only)','Friday (only)','Friday (only)','Friday (only)','Friday (only)','Friday (only)',
'Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)']}

df_2 = pd.DataFrame(data)

'''
Objective:
Create a dataframe which for each Type contains:
- today's Date and Price from df_1
- prior Date values with Price of NaN going back in time according to the Type's corresponding Type-Date value, back to but not including the penultimate date for which a price is available in df_2
'''

dayStrToInt = {'Monday':0,'Tuesday':1,'Wednesday':2,'Thursday':3,'Friday':4,'Saturday':5,'Sunday':6}
freqByType = {}
def setFreqByType(row):
    weekdays = [s.strip() for s in row['Type-Date'].replace('(only)', '').split(',')]
    if not weekdays:
        raise ValueError(f"No weekdays found in Type-Date {repr(row['Type-Date'])}")
    days = []
    for w in weekdays:
        if w not in dayStrToInt:
            raise ValueError(f'Bad day-of-week string {w}')
        days.append(dayStrToInt[w])
    freqByType[row['Type']] = days
import datetime
datePriceListByType = []
def compileDatePriceByType(row):
    curType = row['Type']
    curDate = datetime.datetime.strptime(row['Date'], '%d/%m/%Y').date()
    allDates = [datetime.datetime.strptime(dateStr, '%d/%m/%Y').date() for dateStr in df_2[df_2['Type']==row['Type']]['Date']]
    allDateStrs = [dt.strftime('%d/%m/%Y') for dt in allDates]
    minDate = min(allDates)
    newDates = [curDate]
    dt = curDate
    days = freqByType[curType]
    while dt > minDate:
        curWD = dt.weekday()
        nextWD = curWD
        while nextWD not in days:
            nextWD = (nextWD - 1) % 7
        iWD = (days.index(nextWD) - (1 if nextWD == curWD else 0)) % len(days)
        dt -= datetime.timedelta(days=(curWD - days[iWD]) % 7 if curWD != days[iWD] else 7)
        if dt in allDates:
            break
        if dt > minDate:
            newDates.append(dt)
    datePrice = [[dt.strftime('%d/%m/%Y') for dt in newDates], [row['Price']] + [np.nan]*(len(newDates) - 1)]
    datePriceListByType.append(datePrice)

df_1.apply(setFreqByType, axis=1)
df_1.apply(compileDatePriceByType, axis=1)
df_result = df_1
df_result[['Date', 'Price']] = pd.DataFrame(datePriceListByType, columns=['Date', 'Price'])
df_result = df_result.explode(['Date', 'Price'], ignore_index=True)
print(df_result)

输出:

       Type       State        Date Price                         Type-Date
0  Product1    New York  25/03/2022  5.00                     Friday (only)
1  Product1    New York  18/03/2022   NaN                     Friday (only)
2  Product1    New York  11/03/2022   NaN                     Friday (only)
3  Product2  Washington  25/03/2022  4.00                     Friday (only)
4  Product2  Washington  18/03/2022   NaN                     Friday (only)
5  Product3    Illinois  25/03/2022  4.00  Monday, Wednesday, Friday (only)
6  Product3    Illinois  23/03/2022   NaN  Monday, Wednesday, Friday (only)
7  Product3    Illinois  21/03/2022   NaN  Monday, Wednesday, Friday (only)
8  Product3    Illinois  18/03/2022   NaN  Monday, Wednesday, Friday (only)

更新: 键是(类型,区域)而不只是类型。

如果需要 Type-Date(即每周计划)根据 multi-column 键(例如(类型,地区))而变化,也可以实现。虽然可以根据键列列表对此进行概括,但我将仅分享一个硬编码类型和区域两列的示例:

import pandas as pd
import numpy as np

data = {
'Type': ['Product1', 'Product1', 'Product2', 'Product3'], 
'Region': ['Northeast', 'Southeast', 'Northwest', 'Midwest'], 
'State': ['New York', 'Florida', 'Washington', 'Illinois'], 
'Date':['25/03/2022','25/03/2022','25/03/2022','25/03/2022'], 
'Price':['5.00','4.50','4.00','4.00'], 
'Type-Date':['Friday (only)','Tuesday, Friday (only)','Friday (only)','Monday, Wednesday, Friday (only)']}

df_1 = pd.DataFrame(data)
print(f"df_1\n{df_1}")

data = {'Type': ['Product1', 'Product1', 'Product1','Product1', 'Product1', 'Product1','Product2','Product2','Product2','Product3','Product3','Product3'], 
'Region': ['Northeast', 'Northeast', 'Northeast', 'Southeast', 'Southeast', 'Southeast', 'Northwest', 'Northwest', 'Northwest', 'Midwest', 'Midwest', 'Midwest'], 
'State': ['New York', 'New York','New York', 'Florida', 'Florida', 'Florida', 'Washington', 'Washington', 'Washington', 'Illinois', 'Illinois', 'Illinois'], 
'Date':['25/03/2022','04/03/2022','25/02/2022', '25/03/2022','04/03/2022','25/02/2022', '25/03/2022', '11/03/2022', '04/03/2022', '25/03/2022', '16/03/2022', '14/03/2022'], 
'Price':['5.00','4.00','4.00','4.50','4.25','4.10','4.00','3.00','2.00','4.00','3.00','4.00'], 
'Type-Date':['Friday (only)','Friday (only)','Friday (only)','Tuesday, Friday (only)','Tuesday, Friday (only)','Tuesday, Friday (only)','Friday (only)','Friday (only)','Friday (only)',
'Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)','Monday, Wednesday, Friday (only)']}

df_2 = pd.DataFrame(data)
print(f"df_2\n{df_2}")

'''
Objective:
Create a dataframe which for each (Type, Region) pair contains:
- today's Date and Price from df_1
- prior Date values with Price of NaN going back in time according to the (Type, Region) pair's corresponding Type-Date value, back to but not including the penultimate date for which a price is available in df_2
'''

dayStrToInt = {'Monday':0,'Tuesday':1,'Wednesday':2,'Thursday':3,'Friday':4,'Saturday':5,'Sunday':6}
freqByTypeRegion = {}
def setFreqByTypeRegion(row):
    weekdays = [s.strip() for s in row['Type-Date'].replace('(only)', '').split(',')]
    if not weekdays:
        raise ValueError(f"No weekdays found in Type-Date {repr(row['Type-Date'])}")
    days = []
    for w in weekdays:
        if w not in dayStrToInt:
            raise ValueError(f'Bad day-of-week string {w}')
        days.append(dayStrToInt[w])
    freqByTypeRegion[(row['Type'], row['Region'])] = days
import datetime
datePriceListByTypeRegion = []
def compileDatePriceByTypeRegion(row):
    curTypeRegion = (row['Type'], row['Region'])
    curDate = datetime.datetime.strptime(row['Date'], '%d/%m/%Y').date()
    allDates = [datetime.datetime.strptime(dateStr, '%d/%m/%Y').date() for dateStr in df_2[(df_2['Type']==row['Type']) & (df_2['Region']==row['Region'])]['Date']]
    allDateStrs = [dt.strftime('%d/%m/%Y') for dt in allDates]
    minDate = min(allDates)
    newDates = [curDate]
    dt = curDate
    days = freqByTypeRegion[curTypeRegion]
    while dt > minDate:
        curWD = dt.weekday()
        nextWD = curWD
        while nextWD not in days:
            nextWD = (nextWD - 1) % 7
        iWD = (days.index(nextWD) - (1 if nextWD == curWD else 0)) % len(days)
        dt -= datetime.timedelta(days=(curWD - days[iWD]) % 7 if curWD != days[iWD] else 7)
        if dt in allDates:
            break
        if dt > minDate:
            newDates.append(dt)
    datePrice = [[dt.strftime('%d/%m/%Y') for dt in newDates], [row['Price']] + [np.nan]*(len(newDates) - 1)]
    datePriceListByTypeRegion.append(datePrice)

df_1.apply(setFreqByTypeRegion, axis=1)
df_1.apply(compileDatePriceByTypeRegion, axis=1)
df_result = df_1
df_result[['Date', 'Price']] = pd.DataFrame(datePriceListByTypeRegion, columns=['Date', 'Price'])
df_result = df_result.explode(['Date', 'Price'], ignore_index=True)
print(f"df_result\n{df_result}")

输出:

df_1
       Type     Region       State        Date Price                         Type-Date
0  Product1  Northeast    New York  25/03/2022  5.00                     Friday (only)
1  Product1  Southeast     Florida  25/03/2022  4.50            Tuesday, Friday (only)
2  Product2  Northwest  Washington  25/03/2022  4.00                     Friday (only)
3  Product3    Midwest    Illinois  25/03/2022  4.00  Monday, Wednesday, Friday (only)
df_2
        Type     Region       State        Date Price                         Type-Date
0   Product1  Northeast    New York  25/03/2022  5.00                     Friday (only)
1   Product1  Northeast    New York  04/03/2022  4.00                     Friday (only)
2   Product1  Northeast    New York  25/02/2022  4.00                     Friday (only)
3   Product1  Southeast     Florida  25/03/2022  4.50            Tuesday, Friday (only)
4   Product1  Southeast     Florida  04/03/2022  4.25            Tuesday, Friday (only)
5   Product1  Southeast     Florida  25/02/2022  4.10            Tuesday, Friday (only)
6   Product2  Northwest  Washington  25/03/2022  4.00                     Friday (only)
7   Product2  Northwest  Washington  11/03/2022  3.00                     Friday (only)
8   Product2  Northwest  Washington  04/03/2022  2.00                     Friday (only)
9   Product3    Midwest    Illinois  25/03/2022  4.00  Monday, Wednesday, Friday (only)
10  Product3    Midwest    Illinois  16/03/2022  3.00  Monday, Wednesday, Friday (only)
11  Product3    Midwest    Illinois  14/03/2022  4.00  Monday, Wednesday, Friday (only)
df_result
        Type     Region       State        Date Price                         Type-Date
0   Product1  Northeast    New York  25/03/2022  5.00                     Friday (only)
1   Product1  Northeast    New York  18/03/2022   NaN                     Friday (only)
2   Product1  Northeast    New York  11/03/2022   NaN                     Friday (only)
3   Product1  Southeast     Florida  25/03/2022  4.50            Tuesday, Friday (only)
4   Product1  Southeast     Florida  22/03/2022   NaN            Tuesday, Friday (only)
5   Product1  Southeast     Florida  18/03/2022   NaN            Tuesday, Friday (only)
6   Product1  Southeast     Florida  15/03/2022   NaN            Tuesday, Friday (only)
7   Product1  Southeast     Florida  11/03/2022   NaN            Tuesday, Friday (only)
8   Product1  Southeast     Florida  08/03/2022   NaN            Tuesday, Friday (only)
9   Product2  Northwest  Washington  25/03/2022  4.00                     Friday (only)
10  Product2  Northwest  Washington  18/03/2022   NaN                     Friday (only)
11  Product3    Midwest    Illinois  25/03/2022  4.00  Monday, Wednesday, Friday (only)
12  Product3    Midwest    Illinois  23/03/2022   NaN  Monday, Wednesday, Friday (only)
13  Product3    Midwest    Illinois  21/03/2022   NaN  Monday, Wednesday, Friday (only)
14  Product3    Midwest    Illinois  18/03/2022   NaN  Monday, Wednesday, Friday (only)