Pandas DataFrame 中行值的插补基于特定的不同列行值
Imputation of row values in Pandas DataFrame basis specific different column row values
我有以下df
df = pd.DataFrame({
'Market': {0: 'Zone1',
1: 'Zone1',
2: 'Zone1',
3: 'Zone1',
4: 'Zone2',
5: 'Zone2',
6: 'Zone2',
7: 'Zone2'},
'col1': {0: 'v1',
1: 'v2',
2: 'v3',
3: 'v4',
4: 'v1',
5: 'v2',
6: 'v3',
7: 'v4'},
'col2': {0: np.nan,
1: 1,
2: 6,
3: 2,
4: np.nan,
5: 2,
6: 1,
7: 2,},
'col3': {0: np.nan,
1: 9,
2: 5,
3: 2,
4: np.nan,
5: 0,
6: 9,
7: 1,}})
对于与 col1 中的值 v1 关联的 nan 值的市场每个值(即 Zone1 和 Zone2),我想替换为与 v2 和 v4 关联的值的总和。所以输出看起来像这样 -
Market col1 col2 col3
-----------------------------------
0 | Zone1 v1 3 11
1 | Zone1 v2 1 9
2 | Zone1 v3 6 5
3 | Zone1 v4 2 2
4 | Zone2 v1 4 1
5 | Zone2 v2 2 0
6 | Zone2 v3 1 9
7 | Zone2 v4 2 1
我们可以做简单的for循环
for x in df.Market.unique():
df.loc[df.Market.eq(x) & df.col1.eq('v1'), ['col2', 'col3']] = \
df.loc[df.Market.eq(x) & df.col1.isin(['v2', 'v4']), ['col2', 'col3']].sum().values
df
Out[69]:
Market col1 col2 col3
0 Zone1 v1 3.0 11.0
1 Zone1 v2 1.0 9.0
2 Zone1 v3 6.0 5.0
3 Zone1 v4 2.0 2.0
4 Zone2 v1 4.0 1.0
5 Zone2 v2 2.0 0.0
6 Zone2 v3 1.0 9.0
7 Zone2 v4 2.0 1.0
使用 groupby 的另一个选项:
value_cols = ['col2', 'col3']
df.loc[
df.col1.eq('v1'),
value_cols
] = df[df.col1.eq('v2') |
df.col1.eq('v4')].groupby(['Market'])[value_cols].apply(sum).values
df[value_cols] = df[value_cols].astype(int)
print(df)
输出:
Market col1 col2 col3
0 Zone1 v1 3 11
1 Zone1 v2 1 9
2 Zone1 v3 6 5
3 Zone1 v4 2 2
4 Zone2 v1 4 1
5 Zone2 v2 2 0
6 Zone2 v3 1 9
7 Zone2 v4 2 1
另一种方法是:
一次只能更改一行!
summary=df.query('col1 == "v2" or col1 == "v4" ').groupby('Market').sum()
for ind,row in summary.iterrows():
#df.fillna({'col2': row[0],'col3': row[1]}, inplace=True,limit=1) in case memmory issue
df=df.fillna({'col2': row[0],'col3': row[1]}, inplace=False,limit=1)
df.head(10)
我有以下df
df = pd.DataFrame({
'Market': {0: 'Zone1',
1: 'Zone1',
2: 'Zone1',
3: 'Zone1',
4: 'Zone2',
5: 'Zone2',
6: 'Zone2',
7: 'Zone2'},
'col1': {0: 'v1',
1: 'v2',
2: 'v3',
3: 'v4',
4: 'v1',
5: 'v2',
6: 'v3',
7: 'v4'},
'col2': {0: np.nan,
1: 1,
2: 6,
3: 2,
4: np.nan,
5: 2,
6: 1,
7: 2,},
'col3': {0: np.nan,
1: 9,
2: 5,
3: 2,
4: np.nan,
5: 0,
6: 9,
7: 1,}})
对于与 col1 中的值 v1 关联的 nan 值的市场每个值(即 Zone1 和 Zone2),我想替换为与 v2 和 v4 关联的值的总和。所以输出看起来像这样 -
Market col1 col2 col3
-----------------------------------
0 | Zone1 v1 3 11
1 | Zone1 v2 1 9
2 | Zone1 v3 6 5
3 | Zone1 v4 2 2
4 | Zone2 v1 4 1
5 | Zone2 v2 2 0
6 | Zone2 v3 1 9
7 | Zone2 v4 2 1
我们可以做简单的for循环
for x in df.Market.unique():
df.loc[df.Market.eq(x) & df.col1.eq('v1'), ['col2', 'col3']] = \
df.loc[df.Market.eq(x) & df.col1.isin(['v2', 'v4']), ['col2', 'col3']].sum().values
df
Out[69]:
Market col1 col2 col3
0 Zone1 v1 3.0 11.0
1 Zone1 v2 1.0 9.0
2 Zone1 v3 6.0 5.0
3 Zone1 v4 2.0 2.0
4 Zone2 v1 4.0 1.0
5 Zone2 v2 2.0 0.0
6 Zone2 v3 1.0 9.0
7 Zone2 v4 2.0 1.0
使用 groupby 的另一个选项:
value_cols = ['col2', 'col3']
df.loc[
df.col1.eq('v1'),
value_cols
] = df[df.col1.eq('v2') |
df.col1.eq('v4')].groupby(['Market'])[value_cols].apply(sum).values
df[value_cols] = df[value_cols].astype(int)
print(df)
输出:
Market col1 col2 col3
0 Zone1 v1 3 11
1 Zone1 v2 1 9
2 Zone1 v3 6 5
3 Zone1 v4 2 2
4 Zone2 v1 4 1
5 Zone2 v2 2 0
6 Zone2 v3 1 9
7 Zone2 v4 2 1
另一种方法是: 一次只能更改一行!
summary=df.query('col1 == "v2" or col1 == "v4" ').groupby('Market').sum()
for ind,row in summary.iterrows():
#df.fillna({'col2': row[0],'col3': row[1]}, inplace=True,limit=1) in case memmory issue
df=df.fillna({'col2': row[0],'col3': row[1]}, inplace=False,limit=1)
df.head(10)