找出一列中nans的顺序有多长

Find out how long the order of nans in a column is

我需要按照这个特定规则(有效地)清理数据:

如果一列中有 3 个或更少的连续 NaN,则通过 .fillna(method='ffill') 在 df 列中填充此 NaN“链”。 否则保留它(另一种方法)

示例:

df = pd.DataFrame({"A":[8001, 7999, 7998, np.NaN, 9900, 9342, 9324, 8534, 8358, 9457, np.nan, 8999, 8492, np.nan, np.nan],
                   "B":[201, 209, 298, 300,np.nan, 342, 324, 854, 858, 457, 145, 189, 192, 134, 135],
                   "C":[11991, 15631, 47998, 38030, 19900, 29342, np.nan, np.nan, np.nan,np.nan, 27245, 28999, 28492, 29334, 28234]}, 
                   index=pd.Index(['2019-06-17 00:00:00','2019-06-17 00:01:01', '2019-06-17 00:02:00', '2019-06-17 00:03:04', 
                                   '2020-06-17 00:04:00', '2020-06-17 00:05:00', '2020-06-17 00:06:00', '2020-06-17 00:07:00',
                                   '2020-06-17 00:08:00','2020-06-17 00:09:00','2020-06-17 00:10:00','2020-06-17 00:11:00',
                                   '2020-06-17 00:12:00','2020-06-17 00:13:00', '2020-06-17 00:14:00']))

df

                 Time     A     B       C
'2019-06-17 00:00:00'  8001   201   11991
'2019-06-17 00:01:01'  7999   209   15631
'2019-06-17 00:02:00'  7998   298   47998
'2019-06-17 00:03:04'  NaN    300   38030
'2020-06-17 00:04:00'  9900   NaN   19900
'2020-06-17 00:05:00'  9342   342   29342
'2020-06-17 00:06:00'  9324   324     NaN
'2020-06-17 00:07:00'  8534   854     NaN
'2020-06-17 00:08:00'  8358   858     NaN
'2020-06-17 00:09:00'  9457   457     NaN
'2020-06-17 00:10:00'   NaN   145   27245
'2020-06-17 00:11:00'  8999   189   28999
'2020-06-17 00:12:00'  8492   192   28492
'2020-06-17 00:13:00'   NaN   134   29334
'2020-06-17 00:14:00'   NaN   135   28234

预期结果:

                 Time     A     B       C
'2019-06-17 00:00:00'  8001   201   11991
'2019-06-17 00:01:01'  7999   209   15631
'2019-06-17 00:02:00'  7998   298   47998
'2019-06-17 00:03:04'  7998   300   38030
'2020-06-17 00:04:00'  9900   300   19900
'2020-06-17 00:05:00'  9342   342   29342
'2020-06-17 00:06:00'  9324   324     NaN
'2020-06-17 00:07:00'  8534   854     NaN
'2020-06-17 00:08:00'  8358   858     NaN
'2020-06-17 00:09:00'  9457   457     NaN
'2020-06-17 00:10:00'  9457   145   27245
'2020-06-17 00:11:00'  8999   189   28999
'2020-06-17 00:12:00'  8492   192   28492
'2020-06-17 00:13:00'  8492   134   29334
'2020-06-17 00:14:00'  8492   135   28234

仅确定连续 NaN 组的大小,并找出哪些小于最大间隙大小。然后通过使用该布尔系列来屏蔽整个前向填充列,您可以有效地仅填充小于或等于您指定的间隙大小的间隙。

def fwd_fill_gaps(df, col, gap_max):
    """ Fill conseuctive NaN when size is <= gap_max """

    s = df[col].notnull().cumsum().where(df[col].isnull())
    # Only True for NaN gaps of size <= gap_max
    s = s.groupby(s).transform('size').le(gap_max)

    return df[col].fillna(df[col].ffill().where(s), downcast='infer')


for col in ['A', 'B', 'C']:
    df[col] = fwd_fill_gaps(df, col, gap_max=3)

                        A    B        C
2019-06-17 00:00:00  8001  201  11991.0
2019-06-17 00:01:01  7999  209  15631.0
2019-06-17 00:02:00  7998  298  47998.0
2019-06-17 00:03:04  7998  300  38030.0
2020-06-17 00:04:00  9900  300  19900.0
2020-06-17 00:05:00  9342  342  29342.0
2020-06-17 00:06:00  9324  324      NaN
2020-06-17 00:07:00  8534  854      NaN
2020-06-17 00:08:00  8358  858      NaN
2020-06-17 00:09:00  9457  457      NaN
2020-06-17 00:10:00  9457  145  27245.0
2020-06-17 00:11:00  8999  189  28999.0
2020-06-17 00:12:00  8492  192  28492.0
2020-06-17 00:13:00  8492  134  29334.0
2020-06-17 00:14:00  8492  135  28234.0