按 ID 对 Pandas 行进行分组,当它出现在具有相同 ID 的所有行上时,将它们向前填充到右侧并保留 NaN

Group Pandas rows by ID and forward fill them to the right retaining NaN when it appears on all the rows with the same ID

我有一个 Pandas DataFrame 我需要:

对于每个 ID 分类值和每个指标列(参见下面示例中的 aX 列)只有一个值(其他多行时为 NaN - np.nan)。

以此为例:

In [1]: import numpy as np                                                                                                           

In [2]: import pandas as pd                                                                                                          

In [3]: my_df = pd.DataFrame([ 
   ...:     {"id": 1, "a1": 100.0, "a2": np.nan, "a3": np.nan, "a4": 90.0}, 
   ...:     {"id": 1, "a1": np.nan, "a2": np.nan, "a3": 80.0, "a4": np.nan}, 
   ...:     {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": np.nan}, 
   ...:     {"id": 20, "a1": np.nan, "a2": np.nan, "a3": np.nan, "a4": 30.0}, 
   ...: ])                                                                                                                           

In [4]: my_df.head(len(my_df))                                                                                                       
Out[4]: 
   id     a1  a2     a3    a4
0   1  100.0 NaN    NaN  90.0
1   1    NaN NaN   80.0   NaN
2  20    NaN NaN  100.0   NaN
3  20    NaN NaN    NaN  30.0

我还有很多专栏,例如 a1a4

我愿意:

基本上在示例中这意味着:

看这里:

In [5]: wanted_df = pd.DataFrame([ 
   ...:     {"id": 1, "a1": 100.0, "a2": 100.0, "a3": 80.0, "a4": 90.0}, 
   ...:     {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": 30.0}, 
   ...: ])                                                                                                                           

In [6]: wanted_df.head(len(wanted_df))                                                                                               
Out[6]: 
   id     a1     a2     a3    a4
0   1  100.0  100.0   80.0  90.0
1  20    NaN    NaN  100.0  30.0

In [7]: 

右边的前向填充应该应用于同一行的多个列, 不仅是最右边的一行。

当我使用 my_df.interpolate(method='pad', axis=1,limit=None,limit_direction='forward',limit_area=None,downcast=None,) 时,同一 ID 仍会得到多行。

当我使用 my_df.groupby("id").sum() 时,我到处都看到 0.0 而不是在上面定义的那些场景中保留 NaN 值。

当我使用 my_df.groupby("id").apply(np.sum) 时,ID 列也被求和,所以这是错误的,因为它应该被保留。

我该怎么做?

一个想法是使用 min_count=1sum:

df = my_df.groupby("id").sum(min_count=1)
print (df)
       a1  a2     a3    a4
id                        
1   100.0 NaN   80.0  90.0
20    NaN NaN  100.0  30.0

或者如果需要第一个非缺失值是可能的使用GroupBy.first:

df = my_df.groupby("id").first()
print (df)
       a1  a2     a3    a4
id                        
1   100.0 NaN   80.0  90.0
20    NaN NaN  100.0  30.0

更多的问题是如果每个组有多个非缺失值并且需要所有这些值:

#added 20 to a1
my_df = pd.DataFrame([ 
     {"id": 1, "a1": 100.0, "a2": np.nan, "a3": np.nan, "a4": 90.0}, 
      {"id": 1, "a1": 20, "a2": np.nan, "a3": 80.0, "a4": np.nan}, 
      {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": np.nan}, 
     {"id": 20, "a1": np.nan, "a2": np.nan, "a3": np.nan, "a4": 30.0}, 
 ])   
print (my_df)              
   id     a1  a2     a3    a4
0   1  100.0 NaN    NaN  90.0
1   1   20.0 NaN   80.0   NaN
2  20    NaN NaN  100.0   NaN
3  20    NaN NaN    NaN  30.0

def f(x):
    return x.apply(lambda x: pd.Series(x.dropna().to_numpy()))

df1 = (my_df.set_index('id')
            .groupby("id")
            .apply(f)
            .reset_index(level=1, drop=True)
            .reset_index())
print (df1)

   id     a1  a2     a3    a4
0   1  100.0 NaN   80.0  90.0
1   1   20.0 NaN    NaN   NaN
2  20    NaN NaN  100.0  30.0

第一个和第二个解决方案的工作方式不同:

df2 = my_df.groupby("id").sum(min_count=1)
print (df2)
       a1  a2     a3    a4
id                        
1   120.0 NaN   80.0  90.0
20    NaN NaN  100.0  30.0

df3 = my_df.groupby("id").first()
print (df3)
       a1  a2     a3    a4
id                        
1   100.0 NaN   80.0  90.0
20    NaN NaN  100.0  30.0

如果相同类型的值,这里的数字也可以使用:

#
def justify(a, invalid_val=0, axis=1, side='left'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

    """

    if invalid_val is np.nan:
        mask = ~np.isnan(a)
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    out = np.full(a.shape, invalid_val) 
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out

f = lambda x: pd.DataFrame(justify(x.to_numpy(), 
                                   invalid_val=np.nan, 
                                   axis=0, 
                                   side='up'), columns=my_df.columns.drop('id'))
                .dropna(how='all')
df1 = (my_df.set_index('id')
            .groupby("id")
            .apply(f)
            .reset_index(level=1, drop=True)
            .reset_index())
print (df1)
   id     a1  a2     a3    a4
0   1  100.0 NaN   80.0  90.0
1   1   20.0 NaN    NaN   NaN
2  20    NaN NaN  100.0  30.0