找出一列中nans的顺序有多长
Find out how long the order of nans in a column is
我需要按照这个特定规则(有效地)清理数据:
如果一列中有 3 个或更少的连续 NaN
,则通过 .fillna(method='ffill') 在 df 列中填充此 NaN
“链”。
否则保留它(另一种方法)
示例:
df = pd.DataFrame({"A":[8001, 7999, 7998, np.NaN, 9900, 9342, 9324, 8534, 8358, 9457, np.nan, 8999, 8492, np.nan, np.nan],
"B":[201, 209, 298, 300,np.nan, 342, 324, 854, 858, 457, 145, 189, 192, 134, 135],
"C":[11991, 15631, 47998, 38030, 19900, 29342, np.nan, np.nan, np.nan,np.nan, 27245, 28999, 28492, 29334, 28234]},
index=pd.Index(['2019-06-17 00:00:00','2019-06-17 00:01:01', '2019-06-17 00:02:00', '2019-06-17 00:03:04',
'2020-06-17 00:04:00', '2020-06-17 00:05:00', '2020-06-17 00:06:00', '2020-06-17 00:07:00',
'2020-06-17 00:08:00','2020-06-17 00:09:00','2020-06-17 00:10:00','2020-06-17 00:11:00',
'2020-06-17 00:12:00','2020-06-17 00:13:00', '2020-06-17 00:14:00']))
df
Time A B C
'2019-06-17 00:00:00' 8001 201 11991
'2019-06-17 00:01:01' 7999 209 15631
'2019-06-17 00:02:00' 7998 298 47998
'2019-06-17 00:03:04' NaN 300 38030
'2020-06-17 00:04:00' 9900 NaN 19900
'2020-06-17 00:05:00' 9342 342 29342
'2020-06-17 00:06:00' 9324 324 NaN
'2020-06-17 00:07:00' 8534 854 NaN
'2020-06-17 00:08:00' 8358 858 NaN
'2020-06-17 00:09:00' 9457 457 NaN
'2020-06-17 00:10:00' NaN 145 27245
'2020-06-17 00:11:00' 8999 189 28999
'2020-06-17 00:12:00' 8492 192 28492
'2020-06-17 00:13:00' NaN 134 29334
'2020-06-17 00:14:00' NaN 135 28234
预期结果:
Time A B C
'2019-06-17 00:00:00' 8001 201 11991
'2019-06-17 00:01:01' 7999 209 15631
'2019-06-17 00:02:00' 7998 298 47998
'2019-06-17 00:03:04' 7998 300 38030
'2020-06-17 00:04:00' 9900 300 19900
'2020-06-17 00:05:00' 9342 342 29342
'2020-06-17 00:06:00' 9324 324 NaN
'2020-06-17 00:07:00' 8534 854 NaN
'2020-06-17 00:08:00' 8358 858 NaN
'2020-06-17 00:09:00' 9457 457 NaN
'2020-06-17 00:10:00' 9457 145 27245
'2020-06-17 00:11:00' 8999 189 28999
'2020-06-17 00:12:00' 8492 192 28492
'2020-06-17 00:13:00' 8492 134 29334
'2020-06-17 00:14:00' 8492 135 28234
仅确定连续 NaN 组的大小,并找出哪些小于最大间隙大小。然后通过使用该布尔系列来屏蔽整个前向填充列,您可以有效地仅填充小于或等于您指定的间隙大小的间隙。
def fwd_fill_gaps(df, col, gap_max):
""" Fill conseuctive NaN when size is <= gap_max """
s = df[col].notnull().cumsum().where(df[col].isnull())
# Only True for NaN gaps of size <= gap_max
s = s.groupby(s).transform('size').le(gap_max)
return df[col].fillna(df[col].ffill().where(s), downcast='infer')
for col in ['A', 'B', 'C']:
df[col] = fwd_fill_gaps(df, col, gap_max=3)
A B C
2019-06-17 00:00:00 8001 201 11991.0
2019-06-17 00:01:01 7999 209 15631.0
2019-06-17 00:02:00 7998 298 47998.0
2019-06-17 00:03:04 7998 300 38030.0
2020-06-17 00:04:00 9900 300 19900.0
2020-06-17 00:05:00 9342 342 29342.0
2020-06-17 00:06:00 9324 324 NaN
2020-06-17 00:07:00 8534 854 NaN
2020-06-17 00:08:00 8358 858 NaN
2020-06-17 00:09:00 9457 457 NaN
2020-06-17 00:10:00 9457 145 27245.0
2020-06-17 00:11:00 8999 189 28999.0
2020-06-17 00:12:00 8492 192 28492.0
2020-06-17 00:13:00 8492 134 29334.0
2020-06-17 00:14:00 8492 135 28234.0
我需要按照这个特定规则(有效地)清理数据:
如果一列中有 3 个或更少的连续 NaN
,则通过 .fillna(method='ffill') 在 df 列中填充此 NaN
“链”。
否则保留它(另一种方法)
示例:
df = pd.DataFrame({"A":[8001, 7999, 7998, np.NaN, 9900, 9342, 9324, 8534, 8358, 9457, np.nan, 8999, 8492, np.nan, np.nan],
"B":[201, 209, 298, 300,np.nan, 342, 324, 854, 858, 457, 145, 189, 192, 134, 135],
"C":[11991, 15631, 47998, 38030, 19900, 29342, np.nan, np.nan, np.nan,np.nan, 27245, 28999, 28492, 29334, 28234]},
index=pd.Index(['2019-06-17 00:00:00','2019-06-17 00:01:01', '2019-06-17 00:02:00', '2019-06-17 00:03:04',
'2020-06-17 00:04:00', '2020-06-17 00:05:00', '2020-06-17 00:06:00', '2020-06-17 00:07:00',
'2020-06-17 00:08:00','2020-06-17 00:09:00','2020-06-17 00:10:00','2020-06-17 00:11:00',
'2020-06-17 00:12:00','2020-06-17 00:13:00', '2020-06-17 00:14:00']))
df
Time A B C
'2019-06-17 00:00:00' 8001 201 11991
'2019-06-17 00:01:01' 7999 209 15631
'2019-06-17 00:02:00' 7998 298 47998
'2019-06-17 00:03:04' NaN 300 38030
'2020-06-17 00:04:00' 9900 NaN 19900
'2020-06-17 00:05:00' 9342 342 29342
'2020-06-17 00:06:00' 9324 324 NaN
'2020-06-17 00:07:00' 8534 854 NaN
'2020-06-17 00:08:00' 8358 858 NaN
'2020-06-17 00:09:00' 9457 457 NaN
'2020-06-17 00:10:00' NaN 145 27245
'2020-06-17 00:11:00' 8999 189 28999
'2020-06-17 00:12:00' 8492 192 28492
'2020-06-17 00:13:00' NaN 134 29334
'2020-06-17 00:14:00' NaN 135 28234
预期结果:
Time A B C
'2019-06-17 00:00:00' 8001 201 11991
'2019-06-17 00:01:01' 7999 209 15631
'2019-06-17 00:02:00' 7998 298 47998
'2019-06-17 00:03:04' 7998 300 38030
'2020-06-17 00:04:00' 9900 300 19900
'2020-06-17 00:05:00' 9342 342 29342
'2020-06-17 00:06:00' 9324 324 NaN
'2020-06-17 00:07:00' 8534 854 NaN
'2020-06-17 00:08:00' 8358 858 NaN
'2020-06-17 00:09:00' 9457 457 NaN
'2020-06-17 00:10:00' 9457 145 27245
'2020-06-17 00:11:00' 8999 189 28999
'2020-06-17 00:12:00' 8492 192 28492
'2020-06-17 00:13:00' 8492 134 29334
'2020-06-17 00:14:00' 8492 135 28234
仅确定连续 NaN 组的大小,并找出哪些小于最大间隙大小。然后通过使用该布尔系列来屏蔽整个前向填充列,您可以有效地仅填充小于或等于您指定的间隙大小的间隙。
def fwd_fill_gaps(df, col, gap_max):
""" Fill conseuctive NaN when size is <= gap_max """
s = df[col].notnull().cumsum().where(df[col].isnull())
# Only True for NaN gaps of size <= gap_max
s = s.groupby(s).transform('size').le(gap_max)
return df[col].fillna(df[col].ffill().where(s), downcast='infer')
for col in ['A', 'B', 'C']:
df[col] = fwd_fill_gaps(df, col, gap_max=3)
A B C
2019-06-17 00:00:00 8001 201 11991.0
2019-06-17 00:01:01 7999 209 15631.0
2019-06-17 00:02:00 7998 298 47998.0
2019-06-17 00:03:04 7998 300 38030.0
2020-06-17 00:04:00 9900 300 19900.0
2020-06-17 00:05:00 9342 342 29342.0
2020-06-17 00:06:00 9324 324 NaN
2020-06-17 00:07:00 8534 854 NaN
2020-06-17 00:08:00 8358 858 NaN
2020-06-17 00:09:00 9457 457 NaN
2020-06-17 00:10:00 9457 145 27245.0
2020-06-17 00:11:00 8999 189 28999.0
2020-06-17 00:12:00 8492 192 28492.0
2020-06-17 00:13:00 8492 134 29334.0
2020-06-17 00:14:00 8492 135 28234.0