时间序列根据每天的价值自动标记
time series automate labeling based on value per day
我想使用点参考数据并将其转换为可进行时间序列分析的类型。
我的数据有 start_date 和 end_date,我用它来计算事件的持续时间(如果在同一天开始和 end_date,return 1。持续时间值介于 0 之间到 326).
此外,我还有另一个 pd 数据框,它是一系列扁平化的图像,大小为 x*y,其中包含(除其他外)三列:时间、lon(x)、lat(y)。
现在,如果日期落在 (start_date + 持续时间) 内,我想向后一个数据集附加一个 True 布尔值。否则,return 错误。
下面我根据所需的输出创建了一些示例数据,以可视化我想要实现的目标。有人知道如何解决这个问题吗?
接下来,我想将真实标签添加到该持续时间内的行。请参阅具有所需输出的我的数据示例。有人知道我应该如何解决这个问题吗?
pd.DataFrame({'lat':[50,60],
'long':[-120,-110],
'time':[np.datetime64('2020-01-01'),np.datetime64('2020-05-06')],
'end_time':[np.datetime64('2020-01-04'),np.datetime64('2020-05-06')],
'duration':[4,1]})
lat lon time end_time duration
0 50 -120 2020-01-01 2020-01-04 4
x 60 -110 2020-05-06 2020-05-06 1
所需的输出如下所示:
lat lon time label
0 50 -120 2020-01-01 True
1 50 -120 2020-01-02 True
2 50 -120 2020-01-03 True
3 50 -120 2020-01-04 True
4 50 -120 2020-01-05 False
...
x-1 60 -110 2020-05-05 False
x 60 -110 2020-05-06 True
x+1 60 -110 2020-05-07 False
编辑:下面是我的数据真正结构化的示例:
df = pd.DataFrame({'time': [np.datetime64('2002-05-18'),np.datetime64('2002-05-18'),np.datetime64('2002-05-18'),np.datetime64('2002-05-18'),np.datetime64('2002-05-18')],
'lon': [-116.125, -115.875, -115.625, -115.375, -115.125],
'lat': [55.125, 55.125, 55.125, 55.125, 55.125],
'out_date': [np.datetime64('2002-05-19'), None, None, None, None],
'duration_days': [2, None, None, None, None],
'Fire': [True, None, None, None, None],
'size_ha': [0.1, None, None, None, None],
'fire_count': [1, None, None, None, None]})
time lon lat out_date duration_days Fire size_ha fire_count
0 2002-05-18 -116.125 55.125 2002-05-19 2.0 True 0.1 1.0
1 2002-05-18 -115.875 55.125 NaT NaN None NaN NaN
2 2002-05-18 -115.625 55.125 NaT NaN None NaN NaN
3 2002-05-18 -115.375 55.125 NaT NaN None NaN NaN
4 2002-05-18 -115.125 55.125 NaT NaN None NaN NaN
每天有42x46个lon, lat
的组合,在进入第二天之前通过。在新添加的 table 中,您会看到 'time': 2002-05-18
上发生了一场森林火灾,坐标为 x,y 的持续时间为 2。我想知道如果我现在将 1932 (42x46) 行向下移动到'time': 2002-05-19
以查看更新为 True
的列 'fire'(这是一个标签)。假设我们按经度和纬度分组,数据将查看 'required output' 示例数据框中的数据。
首先将两列都转换为日期时间并将一天添加到 end_date
,然后按 Index.repeat
with subtraction of days and add counter values by GroupBy.cumcount
with to_timedelta
:
重复索引
df['time'] = pd.to_datetime(df['time'])
df['end_time'] = pd.to_datetime(df['end_time']) + pd.Timedelta(1, unit='d')
s = df.pop('end_time').sub(df['time']).dt.days
df = df.loc[df.index.repeat(s)].copy()
counter = df.groupby(level=0).cumcount()
df['time'] = df['time'].add(pd.to_timedelta(counter, unit='d'))
df = df.reset_index(drop=True)
print (df)
lat long time duration
0 50 -120 2020-01-01 4
1 50 -120 2020-01-02 4
2 50 -120 2020-01-03 4
3 50 -120 2020-01-04 4
4 60 -110 2020-05-06 1
然后由另一个 DataFrame
与左连接和 indicator
参数合并,最后比较 both
:
df1 = pd.DataFrame({'time': [np.datetime64('2020-01-03'),np.datetime64('2002-05-18'),
np.datetime64('2002-05-18'),np.datetime64('2002-05-18'),
np.datetime64('2002-05-18')],
'lon': [-120, -115.875, -115.625, -115.375, -115.125],
'lat': [50, 55.125, 55.125, 55.125, 55.125],
'out_date': [np.datetime64('2002-05-19'), None, None, None, None],
'duration_days': [2, None, None, None, None],
'Fire': [True, None, None, None, None],
'size_ha': [0.1, None, None, None, None],
'fire_count': [1, None, None, None, None]})
df = (df1.rename(columns={'lon':'long'})
.merge(df,
on=['lat','long','time'],
indicator='label',
how='left'))
df['label'] = df['label'].eq('both')
print (df)
time long lat out_date duration_days Fire size_ha \
0 2020-01-03 -120.000 50.000 2002-05-19 2.0 True 0.1
1 2002-05-18 -115.875 55.125 NaT NaN None NaN
2 2002-05-18 -115.625 55.125 NaT NaN None NaN
3 2002-05-18 -115.375 55.125 NaT NaN None NaN
4 2002-05-18 -115.125 55.125 NaT NaN None NaN
fire_count duration label
0 1.0 4.0 True
1 NaN NaN False
2 NaN NaN False
3 NaN NaN False
4 NaN NaN False
编辑:如果错误:
ValueError: repeats may not contain negative values
这意味着 end_time
的某些行低于 time
列:
df = pd.DataFrame({'lat':[50,60],
'long':[-120,-110],
'time':[np.datetime64('2020-01-01'),np.datetime64('2020-05-06')],
'end_time':[np.datetime64('2020-01-04'),np.datetime64('2020-05-04')],
'duration':[4,1]})
df['time'] = pd.to_datetime(df['time'])
df['end_time'] = pd.to_datetime(df['end_time'])
print (df)
lat long time end_time duration
0 50 -120 2020-01-01 2020-01-04 4
1 60 -110 2020-05-06 2020-05-04 1 end_time < time
可能的解决方案是交换这些值:
m = df['time'].gt(df['end_time'])
d = {'end_time':'time','time':'end_time'}
df.loc[m, ['time','end_time']] = df.loc[m, ['end_time','time']].rename(columns=d)
df['end_time'] = df['end_time'] + pd.Timedelta(1, unit='d')
print (df)
lat long time end_time duration
0 50 -120 2020-01-01 2020-01-05 4
1 60 -110 2020-05-04 2020-05-07 1
然后使用上面的解决方案:
s = df.pop('end_time').sub(df['time']).dt.days
df = df.loc[df.index.repeat(s)].copy()
counter = df.groupby(level=0).cumcount()
df['time'] = df['time'].add(pd.to_timedelta(counter, unit='d'))
df = df.reset_index(drop=True)
print (df)
lat long time duration
0 50 -120 2020-01-01 4
1 50 -120 2020-01-02 4
2 50 -120 2020-01-03 4
3 50 -120 2020-01-04 4
4 60 -110 2020-05-04 1
5 60 -110 2020-05-05 1
6 60 -110 2020-05-06 1
我想使用点参考数据并将其转换为可进行时间序列分析的类型。 我的数据有 start_date 和 end_date,我用它来计算事件的持续时间(如果在同一天开始和 end_date,return 1。持续时间值介于 0 之间到 326).
此外,我还有另一个 pd 数据框,它是一系列扁平化的图像,大小为 x*y,其中包含(除其他外)三列:时间、lon(x)、lat(y)。
现在,如果日期落在 (start_date + 持续时间) 内,我想向后一个数据集附加一个 True 布尔值。否则,return 错误。
下面我根据所需的输出创建了一些示例数据,以可视化我想要实现的目标。有人知道如何解决这个问题吗?
接下来,我想将真实标签添加到该持续时间内的行。请参阅具有所需输出的我的数据示例。有人知道我应该如何解决这个问题吗?
pd.DataFrame({'lat':[50,60],
'long':[-120,-110],
'time':[np.datetime64('2020-01-01'),np.datetime64('2020-05-06')],
'end_time':[np.datetime64('2020-01-04'),np.datetime64('2020-05-06')],
'duration':[4,1]})
lat lon time end_time duration
0 50 -120 2020-01-01 2020-01-04 4
x 60 -110 2020-05-06 2020-05-06 1
所需的输出如下所示:
lat lon time label
0 50 -120 2020-01-01 True
1 50 -120 2020-01-02 True
2 50 -120 2020-01-03 True
3 50 -120 2020-01-04 True
4 50 -120 2020-01-05 False
...
x-1 60 -110 2020-05-05 False
x 60 -110 2020-05-06 True
x+1 60 -110 2020-05-07 False
编辑:下面是我的数据真正结构化的示例:
df = pd.DataFrame({'time': [np.datetime64('2002-05-18'),np.datetime64('2002-05-18'),np.datetime64('2002-05-18'),np.datetime64('2002-05-18'),np.datetime64('2002-05-18')],
'lon': [-116.125, -115.875, -115.625, -115.375, -115.125],
'lat': [55.125, 55.125, 55.125, 55.125, 55.125],
'out_date': [np.datetime64('2002-05-19'), None, None, None, None],
'duration_days': [2, None, None, None, None],
'Fire': [True, None, None, None, None],
'size_ha': [0.1, None, None, None, None],
'fire_count': [1, None, None, None, None]})
time lon lat out_date duration_days Fire size_ha fire_count
0 2002-05-18 -116.125 55.125 2002-05-19 2.0 True 0.1 1.0
1 2002-05-18 -115.875 55.125 NaT NaN None NaN NaN
2 2002-05-18 -115.625 55.125 NaT NaN None NaN NaN
3 2002-05-18 -115.375 55.125 NaT NaN None NaN NaN
4 2002-05-18 -115.125 55.125 NaT NaN None NaN NaN
每天有42x46个lon, lat
的组合,在进入第二天之前通过。在新添加的 table 中,您会看到 'time': 2002-05-18
上发生了一场森林火灾,坐标为 x,y 的持续时间为 2。我想知道如果我现在将 1932 (42x46) 行向下移动到'time': 2002-05-19
以查看更新为 True
的列 'fire'(这是一个标签)。假设我们按经度和纬度分组,数据将查看 'required output' 示例数据框中的数据。
首先将两列都转换为日期时间并将一天添加到 end_date
,然后按 Index.repeat
with subtraction of days and add counter values by GroupBy.cumcount
with to_timedelta
:
df['time'] = pd.to_datetime(df['time'])
df['end_time'] = pd.to_datetime(df['end_time']) + pd.Timedelta(1, unit='d')
s = df.pop('end_time').sub(df['time']).dt.days
df = df.loc[df.index.repeat(s)].copy()
counter = df.groupby(level=0).cumcount()
df['time'] = df['time'].add(pd.to_timedelta(counter, unit='d'))
df = df.reset_index(drop=True)
print (df)
lat long time duration
0 50 -120 2020-01-01 4
1 50 -120 2020-01-02 4
2 50 -120 2020-01-03 4
3 50 -120 2020-01-04 4
4 60 -110 2020-05-06 1
然后由另一个 DataFrame
与左连接和 indicator
参数合并,最后比较 both
:
df1 = pd.DataFrame({'time': [np.datetime64('2020-01-03'),np.datetime64('2002-05-18'),
np.datetime64('2002-05-18'),np.datetime64('2002-05-18'),
np.datetime64('2002-05-18')],
'lon': [-120, -115.875, -115.625, -115.375, -115.125],
'lat': [50, 55.125, 55.125, 55.125, 55.125],
'out_date': [np.datetime64('2002-05-19'), None, None, None, None],
'duration_days': [2, None, None, None, None],
'Fire': [True, None, None, None, None],
'size_ha': [0.1, None, None, None, None],
'fire_count': [1, None, None, None, None]})
df = (df1.rename(columns={'lon':'long'})
.merge(df,
on=['lat','long','time'],
indicator='label',
how='left'))
df['label'] = df['label'].eq('both')
print (df)
time long lat out_date duration_days Fire size_ha \
0 2020-01-03 -120.000 50.000 2002-05-19 2.0 True 0.1
1 2002-05-18 -115.875 55.125 NaT NaN None NaN
2 2002-05-18 -115.625 55.125 NaT NaN None NaN
3 2002-05-18 -115.375 55.125 NaT NaN None NaN
4 2002-05-18 -115.125 55.125 NaT NaN None NaN
fire_count duration label
0 1.0 4.0 True
1 NaN NaN False
2 NaN NaN False
3 NaN NaN False
4 NaN NaN False
编辑:如果错误:
ValueError: repeats may not contain negative values
这意味着 end_time
的某些行低于 time
列:
df = pd.DataFrame({'lat':[50,60],
'long':[-120,-110],
'time':[np.datetime64('2020-01-01'),np.datetime64('2020-05-06')],
'end_time':[np.datetime64('2020-01-04'),np.datetime64('2020-05-04')],
'duration':[4,1]})
df['time'] = pd.to_datetime(df['time'])
df['end_time'] = pd.to_datetime(df['end_time'])
print (df)
lat long time end_time duration
0 50 -120 2020-01-01 2020-01-04 4
1 60 -110 2020-05-06 2020-05-04 1 end_time < time
可能的解决方案是交换这些值:
m = df['time'].gt(df['end_time'])
d = {'end_time':'time','time':'end_time'}
df.loc[m, ['time','end_time']] = df.loc[m, ['end_time','time']].rename(columns=d)
df['end_time'] = df['end_time'] + pd.Timedelta(1, unit='d')
print (df)
lat long time end_time duration
0 50 -120 2020-01-01 2020-01-05 4
1 60 -110 2020-05-04 2020-05-07 1
然后使用上面的解决方案:
s = df.pop('end_time').sub(df['time']).dt.days
df = df.loc[df.index.repeat(s)].copy()
counter = df.groupby(level=0).cumcount()
df['time'] = df['time'].add(pd.to_timedelta(counter, unit='d'))
df = df.reset_index(drop=True)
print (df)
lat long time duration
0 50 -120 2020-01-01 4
1 50 -120 2020-01-02 4
2 50 -120 2020-01-03 4
3 50 -120 2020-01-04 4
4 60 -110 2020-05-04 1
5 60 -110 2020-05-05 1
6 60 -110 2020-05-06 1