Pandas DataFrame 中行值的插补基于特定的不同列行值

Imputation of row values in Pandas DataFrame basis specific different column row values

我有以下df

df = pd.DataFrame({
 'Market': {0: 'Zone1',
  1: 'Zone1',
  2: 'Zone1',
  3: 'Zone1',
  4: 'Zone2',
  5: 'Zone2',
  6: 'Zone2',
  7: 'Zone2'},
  'col1': {0: 'v1',
  1: 'v2',
  2: 'v3',
  3: 'v4',
  4: 'v1',
  5: 'v2',
  6: 'v3',
  7: 'v4'},
 'col2': {0: np.nan,
  1: 1,
  2: 6,
  3: 2,
  4: np.nan,
  5: 2,
  6: 1,
  7: 2,},
 'col3': {0: np.nan,
  1: 9,
  2: 5,
  3: 2,
  4: np.nan,
  5: 0,
  6: 9,
  7: 1,}})

对于与 col1 中的值 v1 关联的 nan 值的市场每个值(即 Zone1 和 Zone2),我想替换为与 v2 和 v4 关联的值的总和。所以输出看起来像这样 -

        Market col1 col2 col3   
-----------------------------------
0     | Zone1   v1   3   11    
1     | Zone1   v2   1   9     
2     | Zone1   v3   6   5    
3     | Zone1   v4   2   2     
4     | Zone2   v1   4   1
5     | Zone2   v2   2   0     
6     | Zone2   v3   1   9
7     | Zone2   v4   2   1  

我们可以做简单的for循环

for x in df.Market.unique():
      df.loc[df.Market.eq(x) & df.col1.eq('v1'), ['col2', 'col3']] = \
            df.loc[df.Market.eq(x) & df.col1.isin(['v2', 'v4']), ['col2', 'col3']].sum().values
        
        
df
Out[69]: 
  Market col1  col2  col3
0  Zone1   v1   3.0  11.0
1  Zone1   v2   1.0   9.0
2  Zone1   v3   6.0   5.0
3  Zone1   v4   2.0   2.0
4  Zone2   v1   4.0   1.0
5  Zone2   v2   2.0   0.0
6  Zone2   v3   1.0   9.0
7  Zone2   v4   2.0   1.0

使用 groupby 的另一个选项:

value_cols = ['col2', 'col3']

df.loc[
    df.col1.eq('v1'),
    value_cols
] = df[df.col1.eq('v2') |
       df.col1.eq('v4')].groupby(['Market'])[value_cols].apply(sum).values

df[value_cols] = df[value_cols].astype(int)
print(df)

输出:

  Market col1  col2  col3
0  Zone1   v1     3    11
1  Zone1   v2     1     9
2  Zone1   v3     6     5
3  Zone1   v4     2     2
4  Zone2   v1     4     1
5  Zone2   v2     2     0
6  Zone2   v3     1     9
7  Zone2   v4     2     1

另一种方法是: 一次只能更改一行!

summary=df.query('col1 == "v2" or col1 == "v4" ').groupby('Market').sum()
for ind,row in summary.iterrows():
    #df.fillna({'col2': row[0],'col3': row[1]}, inplace=True,limit=1) in case memmory issue
    df=df.fillna({'col2': row[0],'col3': row[1]}, inplace=False,limit=1)
df.head(10)