pandas: 获取日期列之间的范围
pandas: get range between date columns
我有 pandas 数据框:
start_date finish_date progress_id
0 2018-06-23 08:28:50.681065+00 2018-06-23 08:28:52.439542+00 a387ab916f402cb3fbfffd29f68fd0ce
1 2019-03-18 14:23:17.328374+00 2019-03-18 14:54:50.979612+00 3b9dce04f32da32763124602557f92a3
2 2019-07-09 09:18:46.19862+00 2019-07-11 08:03:09.222385+00 73e17a05355852fe65b785c82c37d1ad
3 2018-07-27 15:39:17.666629+00 2018-07-27 16:13:55.086871+00 cc3eb34ae49c719648352c4175daee88
4 2019-04-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00 04ace4fe130d90c801e24eea13ee808e
我将列转换为 datetime.date 因为我不需要 df:
中的时间
df['start_date'] = pd.to_datetime(df['start_date']).dt.date
df['finish_date'] = pd.to_datetime(df['finish_date']).dt.date
所以,如果 start_date
和 finish_date
有相同的月份,我需要一个包含年月的新列。如果它们之间的放置范围不同。例如 start_date = 06-2020,finish_date = 08-2020 结果为 [06-2020, 07-2020, 08-2020]。然后我需要按列 explode
它。
我试过了:
df['range'] = df.apply(lambda x: pd.date_range(x['start_date'], x['finish_date'], freq="M"), axis=1)
df = df.explode('range')
但结果我在专栏中有很多 NaT。
任何解决方案都会很棒。
一种选择如下。假设您有以下数据框,df:
start_date finish_date \
0 2018-06-23 08:28:50.681065+00 2018-06-23 08:28:52.439542+00
1 2019-03-18 14:23:17.328374+00 2019-03-18 14:54:50.979612+00
2 2019-07-09 09:18:46.19862+00 2019-07-11 08:03:09.222385+00
3 2018-07-27 15:39:17.666629+00 2018-07-27 16:13:55.086871+00
4 2019-04-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00
5 2019-05-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00
progress_id
0 a387ab916f402cb3fbfffd29f68fd0ce
1 3b9dce04f32da32763124602557f92a3
2 73e17a05355852fe65b785c82c37d1ad
3 cc3eb34ae49c719648352c4175daee88
4 04ace4fe130d90c801e24eea13ee808e
5 04ace4fe130d90c801e24eea13ee808e
除了日期(年份和月份)不同的一行外,与您共享的相同。
然后应用这个:
df['start_date'] = pd.to_datetime(df['start_date'],format='%Y-%m-%d')
df['finish_date'] = pd.to_datetime(df['finish_date'],format='%Y-%m-%d')
df['finish_M_Y'] = df['finish_date'].dt.strftime('%Y-%m')
df['Start_M_Y'] = df['start_date'].dt.strftime('%Y-%m')
def range(row):
if row['Start_M_Y'] == row['finish_M_Y']:
val = row['Start_M_Y']
elif row['Start_M_Y'] != row['finish_M_Y']:
val = pd.date_range(row['Start_M_Y'] , row['finish_M_Y'], freq='M')
else:
val = -1
return val
df['Range'] = df.apply(range, axis=1)
df.explode('Range').drop(['Start_M_Y', 'finish_M_Y'], axis=1)
给你
start_date finish_date \
0 2018-06-23 08:28:50.681065+00:00 2018-06-23 08:28:52.439542+00:00
1 2019-03-18 14:23:17.328374+00:00 2019-03-18 14:54:50.979612+00:00
2 2019-07-09 09:18:46.198620+00:00 2019-07-11 08:03:09.222385+00:00
3 2018-07-27 15:39:17.666629+00:00 2018-07-27 16:13:55.086871+00:00
4 2019-04-24 18:42:40.272854+00:00 2019-04-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
progress_id Range
0 a387ab916f402cb3fbfffd29f68fd0ce 2018-06
1 3b9dce04f32da32763124602557f92a3 2019-03
2 73e17a05355852fe65b785c82c37d1ad 2019-07
3 cc3eb34ae49c719648352c4175daee88 2018-07
4 04ace4fe130d90c801e24eea13ee808e 2019-04
5 04ace4fe130d90c801e24eea13ee808e 2019-05-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-06-30 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-07-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-08-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-09-30 00:00:00
我有 pandas 数据框:
start_date finish_date progress_id
0 2018-06-23 08:28:50.681065+00 2018-06-23 08:28:52.439542+00 a387ab916f402cb3fbfffd29f68fd0ce
1 2019-03-18 14:23:17.328374+00 2019-03-18 14:54:50.979612+00 3b9dce04f32da32763124602557f92a3
2 2019-07-09 09:18:46.19862+00 2019-07-11 08:03:09.222385+00 73e17a05355852fe65b785c82c37d1ad
3 2018-07-27 15:39:17.666629+00 2018-07-27 16:13:55.086871+00 cc3eb34ae49c719648352c4175daee88
4 2019-04-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00 04ace4fe130d90c801e24eea13ee808e
我将列转换为 datetime.date 因为我不需要 df:
中的时间df['start_date'] = pd.to_datetime(df['start_date']).dt.date
df['finish_date'] = pd.to_datetime(df['finish_date']).dt.date
所以,如果 start_date
和 finish_date
有相同的月份,我需要一个包含年月的新列。如果它们之间的放置范围不同。例如 start_date = 06-2020,finish_date = 08-2020 结果为 [06-2020, 07-2020, 08-2020]。然后我需要按列 explode
它。
我试过了:
df['range'] = df.apply(lambda x: pd.date_range(x['start_date'], x['finish_date'], freq="M"), axis=1)
df = df.explode('range')
但结果我在专栏中有很多 NaT。
任何解决方案都会很棒。
一种选择如下。假设您有以下数据框,df:
start_date finish_date \
0 2018-06-23 08:28:50.681065+00 2018-06-23 08:28:52.439542+00
1 2019-03-18 14:23:17.328374+00 2019-03-18 14:54:50.979612+00
2 2019-07-09 09:18:46.19862+00 2019-07-11 08:03:09.222385+00
3 2018-07-27 15:39:17.666629+00 2018-07-27 16:13:55.086871+00
4 2019-04-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00
5 2019-05-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00
progress_id
0 a387ab916f402cb3fbfffd29f68fd0ce
1 3b9dce04f32da32763124602557f92a3
2 73e17a05355852fe65b785c82c37d1ad
3 cc3eb34ae49c719648352c4175daee88
4 04ace4fe130d90c801e24eea13ee808e
5 04ace4fe130d90c801e24eea13ee808e
除了日期(年份和月份)不同的一行外,与您共享的相同。
然后应用这个:
df['start_date'] = pd.to_datetime(df['start_date'],format='%Y-%m-%d')
df['finish_date'] = pd.to_datetime(df['finish_date'],format='%Y-%m-%d')
df['finish_M_Y'] = df['finish_date'].dt.strftime('%Y-%m')
df['Start_M_Y'] = df['start_date'].dt.strftime('%Y-%m')
def range(row):
if row['Start_M_Y'] == row['finish_M_Y']:
val = row['Start_M_Y']
elif row['Start_M_Y'] != row['finish_M_Y']:
val = pd.date_range(row['Start_M_Y'] , row['finish_M_Y'], freq='M')
else:
val = -1
return val
df['Range'] = df.apply(range, axis=1)
df.explode('Range').drop(['Start_M_Y', 'finish_M_Y'], axis=1)
给你
start_date finish_date \
0 2018-06-23 08:28:50.681065+00:00 2018-06-23 08:28:52.439542+00:00
1 2019-03-18 14:23:17.328374+00:00 2019-03-18 14:54:50.979612+00:00
2 2019-07-09 09:18:46.198620+00:00 2019-07-11 08:03:09.222385+00:00
3 2018-07-27 15:39:17.666629+00:00 2018-07-27 16:13:55.086871+00:00
4 2019-04-24 18:42:40.272854+00:00 2019-04-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
progress_id Range
0 a387ab916f402cb3fbfffd29f68fd0ce 2018-06
1 3b9dce04f32da32763124602557f92a3 2019-03
2 73e17a05355852fe65b785c82c37d1ad 2019-07
3 cc3eb34ae49c719648352c4175daee88 2018-07
4 04ace4fe130d90c801e24eea13ee808e 2019-04
5 04ace4fe130d90c801e24eea13ee808e 2019-05-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-06-30 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-07-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-08-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-09-30 00:00:00