pandas 填充数据框中给定的缺失时间间隔
pandas fill missing time intervals as given in a dataframe
我的 DataFrame 看起来像:
gap_id
species
time_start
time_stop
1
wheat
2021-11-22 00:01:00
2021-11-22 00:03:00
2
fescue
2021-12-18 05:52:00
2021-12-18 05:53:00
我想扩展 DataFrame 以便我得到与 time_start 和 [=36= 之间的分钟数一样多的行] 每个 gap_id:
gap_id
species
time
1
wheat
2021-11-22 00:01:00
1
wheat
2021-11-22 00:02:00
1
wheat
2021-11-22 00:03:00
2
fescue
2021-12-18 05:52:00
2
fescue
2021-12-18 05:53:00
我试过 pd.data_range
方法,但我不知道如何将它与在 gap_id 上制作的 groupby
结合使用
提前致谢
如果 DataFrame 小且性能不重要,则为每一行生成 date_range
and then use DataFrame.explode
:
df['time'] = df.apply(lambda x: pd.date_range(x['time_start'], x['time_stop'], freq='T'), axis=1)
df = df.drop(['time_start','time_stop'], axis=1).explode('time')
print (df)
gap_id species time
0 1 wheat 2021-11-22 00:01:00
0 1 wheat 2021-11-22 00:02:00
0 1 wheat 2021-11-22 00:03:00
1 2 fescue 2021-12-18 05:52:00
1 2 fescue 2021-12-18 05:53:00
对于大型 DataFrame,首先在分钟内按差异 start
和 stop
列重复索引,然后按 GroupBy.cumcount
with convert to timedeltas by to_timedelta
添加计数器:
df['time_start'] = pd.to_datetime(df['time_start'])
df['time_stop'] = pd.to_datetime(df['time_stop'])
df = (df.loc[df.index.repeat(df['time_stop'].sub(df['time_start']).dt.total_seconds() // 60 + 1)]
.drop('time_stop', axis=1)
.rename(columns={'time_start':'time'}))
td = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='Min')
df['time'] += td
df = df.reset_index(drop=True)
print (df)
gap_id species time
0 1 wheat 2021-11-22 00:01:00
1 1 wheat 2021-11-22 00:02:00
2 1 wheat 2021-11-22 00:03:00
3 2 fescue 2021-12-18 05:52:00
4 2 fescue 2021-12-18 05:53:00
我的 DataFrame 看起来像:
gap_id | species | time_start | time_stop |
---|---|---|---|
1 | wheat | 2021-11-22 00:01:00 | 2021-11-22 00:03:00 |
2 | fescue | 2021-12-18 05:52:00 | 2021-12-18 05:53:00 |
我想扩展 DataFrame 以便我得到与 time_start 和 [=36= 之间的分钟数一样多的行] 每个 gap_id:
gap_id | species | time |
---|---|---|
1 | wheat | 2021-11-22 00:01:00 |
1 | wheat | 2021-11-22 00:02:00 |
1 | wheat | 2021-11-22 00:03:00 |
2 | fescue | 2021-12-18 05:52:00 |
2 | fescue | 2021-12-18 05:53:00 |
我试过 pd.data_range
方法,但我不知道如何将它与在 gap_id 上制作的 groupby
结合使用
提前致谢
如果 DataFrame 小且性能不重要,则为每一行生成 date_range
and then use DataFrame.explode
:
df['time'] = df.apply(lambda x: pd.date_range(x['time_start'], x['time_stop'], freq='T'), axis=1)
df = df.drop(['time_start','time_stop'], axis=1).explode('time')
print (df)
gap_id species time
0 1 wheat 2021-11-22 00:01:00
0 1 wheat 2021-11-22 00:02:00
0 1 wheat 2021-11-22 00:03:00
1 2 fescue 2021-12-18 05:52:00
1 2 fescue 2021-12-18 05:53:00
对于大型 DataFrame,首先在分钟内按差异 start
和 stop
列重复索引,然后按 GroupBy.cumcount
with convert to timedeltas by to_timedelta
添加计数器:
df['time_start'] = pd.to_datetime(df['time_start'])
df['time_stop'] = pd.to_datetime(df['time_stop'])
df = (df.loc[df.index.repeat(df['time_stop'].sub(df['time_start']).dt.total_seconds() // 60 + 1)]
.drop('time_stop', axis=1)
.rename(columns={'time_start':'time'}))
td = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='Min')
df['time'] += td
df = df.reset_index(drop=True)
print (df)
gap_id species time
0 1 wheat 2021-11-22 00:01:00
1 1 wheat 2021-11-22 00:02:00
2 1 wheat 2021-11-22 00:03:00
3 2 fescue 2021-12-18 05:52:00
4 2 fescue 2021-12-18 05:53:00