如何在 pandas 中清理大数据并对其进行重塑?
How to clean up big data and reshape it in pandas?
我有一个包含 1-year
数据的数据集,每天的每一分钟都会对其进行采样。但是,在某些天或小时内,样本数少于 59 min
(传感器已关闭)。因此,它没有相应的时间步长。此外,该系列中还有一些 NaN 值。数据如下所示:
time x
2019-01-01 00:00:00 10.0 # Day 1
2019-01-01 00:01:00 9.0 # Day 1
... ...
2019-01-01 00:59:00 14.0 # Day 1
... ...
2019-01-02 00:00:00 10.0 # Day 2
2019-01-02 00:01:00 9.0 # Day 2
2019-01-02 00:02:00 NaN # Day 2
... ...
2019-01-02 00:50:00 14.0 # Day 2
如您所见,对于第 1 天,数据集包含一天中第一个小时内每一分钟的有效值。第二天第一个小时只有50 minutes
。此外,那里还有一些 Nan 值。
所以我的objective就是以合理的方式清理这些数据,并对其进行整形以供进一步处理。
如果在几个小时内传感器关闭并且没有读数(时间指数小于 59 分钟,如上述第 2 天),将指数延长至 59 分钟并将相应的值表示为南.
如果每小时超过 80% 的值是 Nan,则从数据集中删除该特定小时。否则,将 Nan 值替换为之前的值。
按纵轴上的日期-小时和横轴上的分钟重塑数据框。 (我需要最终的数据框看起来像这样)
Date-Hour min_00 ... min_59
2019-01-01 00:01 10 14
...
以下是我目前尝试的方法,但不能完全满足上述步骤:
df.set_axis(['time', 'x'], axis=1, inplace=True) # Setting readable name for columns
df.set_index('time', inplace = True) # Setting time column as index
df.index = pd.to_datetime(df.index) # converting the index to time stamps
# step 1 is needed (but how?) to extend the indices to have all 60 min long
# even if for some periods there are no data and time index available in data set.
# first of step 2
# if more than 80% of values for each hour is Nan, drop that particular hour
# from the dataset, i.e. if more than 12 min (60 min - 80%*60 =12 min) has
# Nan values, drop that hour. (How?)
# filling all NaN values with their next value (second part of step 2)
df.fillna(method='ffill', inplace=True)
# Step 3 (incomplete)
# Reshaping the df to Date vs time (hours, minutes)
df.set_index([df.index.date, df.index.time], inplace=True)
df = df.unstack(level=[1])
# However I want it to be like Date-Hour vs minute (but how?)
# perhaps it would be easier to apply step 3 before step 2. Because removing
# the non-sense hours (with more than 48 min NaN) would be easier as each hour
# appears in a column
找到正确的方法来填补缺失的分钟数需要一些时间,因为 resample
对于不规则的开始时间是一种解决方法,但这里是
# necessary stmts for datetime index goes here
a = df.groupby(pd.Grouper(freq='60Min'))
col_list = ['Date-Hour']
for i in range(60):
col_list.append('min_'+str(i))
new_df = pd.DataFrame(columns=col_list)
for idx, i in enumerate(a):
if len(i[1])>48: # The 80% from the step-2 hard-coded here
st = str(i[0])
nidx = pd.date_range(start=st, end = st[:-5]+'59:00', freq='1T')
ns = pd.Series(np.nan, index = nidx)
comb_series = pd.concat([i[1], ns])
comb_series = comb_series[~comb_series.index.duplicated(keep='first')]
comb_series.sort_index(inplace=True) #required due to the concat above
tmp = comb_series['x'].tolist()
tmp.insert(0,st[:-3])
new_df.loc[idx] = tmp
我想会有很多数据,所以时间会很慢
我有一个包含 1-year
数据的数据集,每天的每一分钟都会对其进行采样。但是,在某些天或小时内,样本数少于 59 min
(传感器已关闭)。因此,它没有相应的时间步长。此外,该系列中还有一些 NaN 值。数据如下所示:
time x
2019-01-01 00:00:00 10.0 # Day 1
2019-01-01 00:01:00 9.0 # Day 1
... ...
2019-01-01 00:59:00 14.0 # Day 1
... ...
2019-01-02 00:00:00 10.0 # Day 2
2019-01-02 00:01:00 9.0 # Day 2
2019-01-02 00:02:00 NaN # Day 2
... ...
2019-01-02 00:50:00 14.0 # Day 2
如您所见,对于第 1 天,数据集包含一天中第一个小时内每一分钟的有效值。第二天第一个小时只有50 minutes
。此外,那里还有一些 Nan 值。
所以我的objective就是以合理的方式清理这些数据,并对其进行整形以供进一步处理。
如果在几个小时内传感器关闭并且没有读数(时间指数小于 59 分钟,如上述第 2 天),将指数延长至 59 分钟并将相应的值表示为南.
如果每小时超过 80% 的值是 Nan,则从数据集中删除该特定小时。否则,将 Nan 值替换为之前的值。
按纵轴上的日期-小时和横轴上的分钟重塑数据框。 (我需要最终的数据框看起来像这样)
Date-Hour min_00 ... min_59
2019-01-01 00:01 10 14
...
以下是我目前尝试的方法,但不能完全满足上述步骤:
df.set_axis(['time', 'x'], axis=1, inplace=True) # Setting readable name for columns
df.set_index('time', inplace = True) # Setting time column as index
df.index = pd.to_datetime(df.index) # converting the index to time stamps
# step 1 is needed (but how?) to extend the indices to have all 60 min long
# even if for some periods there are no data and time index available in data set.
# first of step 2
# if more than 80% of values for each hour is Nan, drop that particular hour
# from the dataset, i.e. if more than 12 min (60 min - 80%*60 =12 min) has
# Nan values, drop that hour. (How?)
# filling all NaN values with their next value (second part of step 2)
df.fillna(method='ffill', inplace=True)
# Step 3 (incomplete)
# Reshaping the df to Date vs time (hours, minutes)
df.set_index([df.index.date, df.index.time], inplace=True)
df = df.unstack(level=[1])
# However I want it to be like Date-Hour vs minute (but how?)
# perhaps it would be easier to apply step 3 before step 2. Because removing
# the non-sense hours (with more than 48 min NaN) would be easier as each hour
# appears in a column
找到正确的方法来填补缺失的分钟数需要一些时间,因为 resample
对于不规则的开始时间是一种解决方法,但这里是
# necessary stmts for datetime index goes here
a = df.groupby(pd.Grouper(freq='60Min'))
col_list = ['Date-Hour']
for i in range(60):
col_list.append('min_'+str(i))
new_df = pd.DataFrame(columns=col_list)
for idx, i in enumerate(a):
if len(i[1])>48: # The 80% from the step-2 hard-coded here
st = str(i[0])
nidx = pd.date_range(start=st, end = st[:-5]+'59:00', freq='1T')
ns = pd.Series(np.nan, index = nidx)
comb_series = pd.concat([i[1], ns])
comb_series = comb_series[~comb_series.index.duplicated(keep='first')]
comb_series.sort_index(inplace=True) #required due to the concat above
tmp = comb_series['x'].tolist()
tmp.insert(0,st[:-3])
new_df.loc[idx] = tmp
我想会有很多数据,所以时间会很慢