时间序列 Dataframe Groupby 3d Array - observation/row 计数 - 对于 LSTM
Time Series Dataframe Groupby 3d Array - observation/row count - For LSTM
我有一个时间序列,其结构如下所示,还有标识符列和两个值列(浮点数)
数据帧只调用 df:
Date Id Value1 Value2
2014-10-01 A 1.1 1.2
2014-10-01 B 1.3 1.4
2014-10-02 A 1.5 1.6
2014-10-02 B 1.7 1.8
2014-10-03 A 3.2 4.8
2014-10-03 B 8.2 10.1
2014-10-04 A 6.1 7.2
2014-10-04 B 4.3 4.1
我想做的是将它变成一个数组,该数组由具有滚动 3 观察期的标识符列分组,所以我最终会得到这个:
[[[1.1 1.2]
[1.5 1.6] '----> ID A 10/1 to 10/3'
[3.2 4.8]]
[[1.3 1.4]
[1.7 1.8] '----> ID B 10/1 to 10/3'
[8.2 10.1]]
[[1.5 1.6]
[3.2 4.8] '----> ID A 10/2 to 10/4'
[6.1 7.2]]
[[1.7 1.8]
[8.2 10.1] '----> ID B 10/2 to 10/4'
[4.3 4.1]]]
当然忽略数组中上面引号中的部分,但希望您能理解。
我有一个更大的数据集,它有更多的标识符,可能需要更改观察计数,所以不能硬性计算行数。到目前为止,我倾向于采用 ID 列的唯一值并通过创建一个临时 df 并对其进行迭代来一次迭代和获取 3 个值。
似乎有更好更快的方法来做到这一点。
“伪代码”
unique_ids = df.ID.unique().tolist()
for id in unique_ids:
temp_df = df.loc[df['Id']==id]]
虽然我坚持的部分还有迭代 temp_df 的最佳方法。
最终输出将用于 LSTM 模型;然而,大多数其他解决方案被编写为不需要像列 'Id'.
那样处理 groupby 方面
这是我最终为解决方案所做的事情,不是最简单的,但话说回来,我的问题是一开始就没有赢得任何选美比赛
id_list = array_steps_df['Id'].unique().tolist()
# change number of steps as needed
step = 3
column_list = ['Value1', 'Value2']
master_list = []
for id in id_list:
master_dict = {}
for column in column_list:
array_steps_id_df = array_steps_df.loc[array_steps_df['Id'] == id]
array_steps_id_df = array_steps_id_df[[column]].values
master_dict[column] = []
for obs in range(len(array_steps_id_df)-step+1):
start_obs = obs + step
master_dict[column].append(array_steps_id_df[obs:start_obs,])
master_list.append(master_dict)
for idx, dic in enumerate(master_list):
# init arrays here
if idx == 0:
value1_array_init = master_list[0]['Value1']
value2_array_init = master_list[1]['Value2']
else:
value1_array_init += master_list[idx]['Value1']
value2_array_init += master_list[idx]['Value2']
value1_array = np.array(value1_array_init)
value2_array = np.array(value2_array_init)
all_array = np.hstack((value1_array, value2_array)).reshape((len(array_steps_df) - (step + 1),
len(column_list),
step)).transpose(0, 2, 1)
已修复,我的错误在末尾添加了一个转置,并重新调整了特征和重塑步骤的顺序。
感谢此网站的一些额外帮助
https://www.mikulskibartosz.name/how-to-turn-pandas-data-frame-into-time-series-input-for-rnn/
我最后重做了一点,使列更具动态性并保持时间序列有序,还添加了一个目标数组以保持预测有序。对于任何需要这个的人来说,这里的功能是:
def data_to_array_steps(array_steps_df, time_steps, columns_to_array, id_column):
"""
https: //www.mikulskibartosz.name/ how - to - turn - pandas - data - frame - into - time - series - input - for -rnn /
:param array_steps_df: the dataframe from the csv
:param time_steps: how many time steps
:param columns_to_array: what columns to convert to the array
:param id_column: what is to be used for the identifier
:return: data grouped in a # observations by identifier and date
"""
id_list = array_steps_df[id_column].unique().tolist()
date_list = array_steps_df['date'].unique().tolist()
master_list = []
target_list = []
missing_counter = 0
total_counter = 0
# grab date size = time steps at a time and iterate through all of them
for date in range(len(date_list) - time_steps + 1):
date_range_test = date_list[date:time_steps+date]
date_range_df = array_steps_df.loc[(array_steps_df['date'] <= date_range_test[-1]) &
(array_steps_df['date'] >= date_range_test[0])
]
# for each id do it separately so time series data doesn't get mixed up
for identifier in id_list:
# get id in here and then skip if not the required time steps/observations for the id
date_range_id = date_range_df.loc[date_range_df[id_column] == identifier]
master_dict = {}
# if there aren't enough observations for the data range
if len(date_range_id) != time_steps:
# dont fully need the counter except in unusual circumstances when debugging it causes no harm for now
missing_counter += 1
else:
# add target each loop through for the last date in the date range for the id or ticker
target = array_steps_df['target'].\
loc[(array_steps_df['date'] == date_range_test[-1])
& (array_steps_df[id_column] == identifier)
].iloc[0]
target_list.append(target)
total_counter += 1
# loop through each column in dataframe
for column in columns_to_array:
date_range_id_value = date_range_id[[column]].values
master_dict[column] = []
master_dict[column].append(date_range_id_value)
master_list.append(master_dict)
# redo columns to arrays, after they have been ordered and grouped by Id above
array_list = []
# for each column go through the values in the array create an array for the column then append to list
for column in columns_to_array:
for idx, dic in enumerate(master_list):
# init arrays here if the first value
if idx == 0:
value_array_init = master_list[0][column]
else:
value_array_init += master_list[idx][column]
array_list.append(np.array(value_array_init))
# for each value in the array list, horizontally stack each value
all_array = np.hstack(array_list).reshape((total_counter,
len(columns_to_array),
time_steps
)
).transpose(0, 2, 1)
target_array_all = np.array(target_list
).reshape(len(target_list),
1)
# should probably make this an if condition later after a few more tests
print('check of length of arrays', len(all_array), len(target_array_all))
return all_array, target_array_all
我有一个时间序列,其结构如下所示,还有标识符列和两个值列(浮点数)
数据帧只调用 df:
Date Id Value1 Value2
2014-10-01 A 1.1 1.2
2014-10-01 B 1.3 1.4
2014-10-02 A 1.5 1.6
2014-10-02 B 1.7 1.8
2014-10-03 A 3.2 4.8
2014-10-03 B 8.2 10.1
2014-10-04 A 6.1 7.2
2014-10-04 B 4.3 4.1
我想做的是将它变成一个数组,该数组由具有滚动 3 观察期的标识符列分组,所以我最终会得到这个:
[[[1.1 1.2]
[1.5 1.6] '----> ID A 10/1 to 10/3'
[3.2 4.8]]
[[1.3 1.4]
[1.7 1.8] '----> ID B 10/1 to 10/3'
[8.2 10.1]]
[[1.5 1.6]
[3.2 4.8] '----> ID A 10/2 to 10/4'
[6.1 7.2]]
[[1.7 1.8]
[8.2 10.1] '----> ID B 10/2 to 10/4'
[4.3 4.1]]]
当然忽略数组中上面引号中的部分,但希望您能理解。 我有一个更大的数据集,它有更多的标识符,可能需要更改观察计数,所以不能硬性计算行数。到目前为止,我倾向于采用 ID 列的唯一值并通过创建一个临时 df 并对其进行迭代来一次迭代和获取 3 个值。 似乎有更好更快的方法来做到这一点。
“伪代码”
unique_ids = df.ID.unique().tolist()
for id in unique_ids:
temp_df = df.loc[df['Id']==id]]
虽然我坚持的部分还有迭代 temp_df 的最佳方法。
最终输出将用于 LSTM 模型;然而,大多数其他解决方案被编写为不需要像列 'Id'.
那样处理 groupby 方面这是我最终为解决方案所做的事情,不是最简单的,但话说回来,我的问题是一开始就没有赢得任何选美比赛
id_list = array_steps_df['Id'].unique().tolist()
# change number of steps as needed
step = 3
column_list = ['Value1', 'Value2']
master_list = []
for id in id_list:
master_dict = {}
for column in column_list:
array_steps_id_df = array_steps_df.loc[array_steps_df['Id'] == id]
array_steps_id_df = array_steps_id_df[[column]].values
master_dict[column] = []
for obs in range(len(array_steps_id_df)-step+1):
start_obs = obs + step
master_dict[column].append(array_steps_id_df[obs:start_obs,])
master_list.append(master_dict)
for idx, dic in enumerate(master_list):
# init arrays here
if idx == 0:
value1_array_init = master_list[0]['Value1']
value2_array_init = master_list[1]['Value2']
else:
value1_array_init += master_list[idx]['Value1']
value2_array_init += master_list[idx]['Value2']
value1_array = np.array(value1_array_init)
value2_array = np.array(value2_array_init)
all_array = np.hstack((value1_array, value2_array)).reshape((len(array_steps_df) - (step + 1),
len(column_list),
step)).transpose(0, 2, 1)
已修复,我的错误在末尾添加了一个转置,并重新调整了特征和重塑步骤的顺序。
感谢此网站的一些额外帮助
https://www.mikulskibartosz.name/how-to-turn-pandas-data-frame-into-time-series-input-for-rnn/
我最后重做了一点,使列更具动态性并保持时间序列有序,还添加了一个目标数组以保持预测有序。对于任何需要这个的人来说,这里的功能是:
def data_to_array_steps(array_steps_df, time_steps, columns_to_array, id_column):
"""
https: //www.mikulskibartosz.name/ how - to - turn - pandas - data - frame - into - time - series - input - for -rnn /
:param array_steps_df: the dataframe from the csv
:param time_steps: how many time steps
:param columns_to_array: what columns to convert to the array
:param id_column: what is to be used for the identifier
:return: data grouped in a # observations by identifier and date
"""
id_list = array_steps_df[id_column].unique().tolist()
date_list = array_steps_df['date'].unique().tolist()
master_list = []
target_list = []
missing_counter = 0
total_counter = 0
# grab date size = time steps at a time and iterate through all of them
for date in range(len(date_list) - time_steps + 1):
date_range_test = date_list[date:time_steps+date]
date_range_df = array_steps_df.loc[(array_steps_df['date'] <= date_range_test[-1]) &
(array_steps_df['date'] >= date_range_test[0])
]
# for each id do it separately so time series data doesn't get mixed up
for identifier in id_list:
# get id in here and then skip if not the required time steps/observations for the id
date_range_id = date_range_df.loc[date_range_df[id_column] == identifier]
master_dict = {}
# if there aren't enough observations for the data range
if len(date_range_id) != time_steps:
# dont fully need the counter except in unusual circumstances when debugging it causes no harm for now
missing_counter += 1
else:
# add target each loop through for the last date in the date range for the id or ticker
target = array_steps_df['target'].\
loc[(array_steps_df['date'] == date_range_test[-1])
& (array_steps_df[id_column] == identifier)
].iloc[0]
target_list.append(target)
total_counter += 1
# loop through each column in dataframe
for column in columns_to_array:
date_range_id_value = date_range_id[[column]].values
master_dict[column] = []
master_dict[column].append(date_range_id_value)
master_list.append(master_dict)
# redo columns to arrays, after they have been ordered and grouped by Id above
array_list = []
# for each column go through the values in the array create an array for the column then append to list
for column in columns_to_array:
for idx, dic in enumerate(master_list):
# init arrays here if the first value
if idx == 0:
value_array_init = master_list[0][column]
else:
value_array_init += master_list[idx][column]
array_list.append(np.array(value_array_init))
# for each value in the array list, horizontally stack each value
all_array = np.hstack(array_list).reshape((total_counter,
len(columns_to_array),
time_steps
)
).transpose(0, 2, 1)
target_array_all = np.array(target_list
).reshape(len(target_list),
1)
# should probably make this an if condition later after a few more tests
print('check of length of arrays', len(all_array), len(target_array_all))
return all_array, target_array_all