如何在 pandas 中清理大数据并对其进行重塑?

How to clean up big data and reshape it in pandas?

我有一个包含 1-year 数据的数据集,每天的每一分钟都会对其进行采样。但是,在某些天或小时内,样本数少于 59 min(传感器已关闭)。因此,它没有相应的时间步长。此外,该系列中还有一些 NaN 值。数据如下所示:

        time             x
2019-01-01 00:00:00    10.0    # Day 1
2019-01-01 00:01:00    9.0     # Day 1
... ...
2019-01-01 00:59:00   14.0    # Day 1

... ...
2019-01-02 00:00:00    10.0    # Day 2
2019-01-02 00:01:00    9.0     # Day 2
2019-01-02 00:02:00    NaN     # Day 2
... ...
2019-01-02 00:50:00    14.0    # Day 2

如您所见,对于第 1 天,数据集包含一天中第一个小时内每一分钟的有效值。第二天第一个小时只有50 minutes。此外,那里还有一些 Nan 值。

所以我的objective就是以合理的方式清理这些数据,并对其进行整形以供进一步处理。

  1. 如果在几个小时内传感器关闭并且没有读数(时间指数小于 59 分钟,如上述第 2 天),将指数延长至 59 分钟并将相应的值表示为南.

  2. 如果每小时超过 80% 的值是 Nan,则从数据集中删除该特定小时。否则,将 Nan 值替换为之前的值。

  3. 按纵轴上的日期-小时和横轴上的分钟重塑数据框。 (我需要最终的数据框看起来像这样)

    Date-Hour min_00 ... min_59

    2019-01-01 00:01 10 14

    ...

以下是我目前尝试的方法,但不能完全满足上述步骤:

df.set_axis(['time', 'x'], axis=1, inplace=True)  # Setting readable name for columns
df.set_index('time', inplace = True) # Setting time column as index
df.index = pd.to_datetime(df.index)  # converting the index to time stamps


# step 1 is needed (but how?) to extend the indices to have all 60 min long 
# even if for some periods there are no data and time index available in data set.


# first of step 2
# if more than 80% of values for each hour is Nan, drop that particular hour 
#  from the dataset, i.e. if more than 12 min (60 min - 80%*60 =12 min) has 
# Nan values, drop that hour. (How?)

# filling all NaN values with their next value (second part of step 2) 
df.fillna(method='ffill', inplace=True) 

# Step 3 (incomplete)  
# Reshaping the df to Date vs time (hours, minutes)
df.set_index([df.index.date, df.index.time], inplace=True) 
df = df.unstack(level=[1])
# However I want it to be like Date-Hour vs minute  (but how?)

# perhaps it would be easier to apply step 3 before step 2. Because removing 
# the non-sense hours (with more than 48 min NaN) would be easier as each hour 
# appears in a column

找到正确的方法来填补缺失的分钟数需要一些时间,因为 resample 对于不规则的开始时间是一种解决方法,但这里是

# necessary stmts for datetime index goes here

a = df.groupby(pd.Grouper(freq='60Min'))

col_list = ['Date-Hour']
for i in range(60):
    col_list.append('min_'+str(i))

new_df = pd.DataFrame(columns=col_list)

for idx, i in enumerate(a):
    if len(i[1])>48: # The 80% from the step-2 hard-coded here
        st = str(i[0])
        nidx = pd.date_range(start=st, end = st[:-5]+'59:00', freq='1T')
        ns = pd.Series(np.nan, index = nidx)
        comb_series = pd.concat([i[1], ns])
        comb_series = comb_series[~comb_series.index.duplicated(keep='first')]
        comb_series.sort_index(inplace=True) #required due to the concat above
        tmp = comb_series['x'].tolist()
        tmp.insert(0,st[:-3])
        new_df.loc[idx] = tmp

我想会有很多数据,所以时间会很慢