pandas 中的新列,其值取决于其他列
A new column in pandas which value depends on other columns
我有一个示例数据:
datetime col1 col2 col3
2021-04-10 01:00:00 25. 50. 50
2021-04-10 02:00:00. 25. 50. 50
2021-04-10 03:00:00. 25. 100. 50
2021-04-10 04:00:00 50. 50. 100
2021-04-10 05:00:00. 100. 100. 100
我想创建一个名为 state 的新列,如果 col2 和 col3 的值小于或等于 50,则该列 returns col1 值,否则 returns col1、column2 和 column3 之间的最大值.
预期输出如下图:
datetime col1 col2 col3. state
2021-04-10 01:00:00 25. 50. 50. 25
2021-04-10 02:00:00. 25. 50. 50. 25
2021-04-10 03:00:00. 25. 100. 50. 100
2021-04-10 04:00:00 50. 50. 100. 100
2021-04-10 05:00:00. 100. 100. 100. 100
您可以遍历数据框的行并检查条件
values = []
for ind, row in df.iterrows():
if row['col2'] <= 50 & row['col3'] <= 50:
values.append(row['col1'])
else:
values.append(max(row['col1'], row['col2'], row['col3']))
df['state'] = values
print(df)
datetime col1 col2 col3 state
2021-04-10 01:00:00 25 50 50 25
2021-04-10 02:00:00 25 50 50 25
2021-04-10 03:00:00 25 100 50 100
2021-04-10 04:00:00 50 50 100 100
2021-04-10 05:00:00 100 100 100 100
# Create a mask:
# Create a mask for the basic condition
mask1 = ((df['col2'] <= 50) & (df['col3'] <= 50))
# Use loc to select rows where condition is met and input the df['col1'] value in state
df.loc[mask1, 'state'] = df['col1']
# Check for rows where condition is not met ~ does that, input the mean in state.
df.loc[~mask1, 'state'] = (df['col1'] + df['col2'] + df['col3'])/3
为了改进其他答案,我会使用 pandas apply 遍历行并计算新列。
def calc_new_col(row):
if row['col2'] <= 50 & row['col3'] <= 50:
return row['col1']
else:
return max(row['col1'], row['col2'], row['col3'])
df["state"] = df.apply(calc_new_col, axis=1)
# axis=1 makes sure that function is applied to each row
print(df)
datetime col1 col2 col3 state
2021-04-10 01:00:00 25 50 50 25
2021-04-10 02:00:00 25 50 50 25
2021-04-10 03:00:00 25 100 50 100
2021-04-10 04:00:00 50 50 100 100
2021-04-10 05:00:00 100 100 100 100
apply
帮助代码更简洁、更可重用。
使用np.where的选项:
import numpy as np
import pandas as pd
df = pd.DataFrame({'datetime': {0: '2021-04-10 01:00:00', 1: '2021-04-10 02:00:00',
2: '2021-04-10 03:00:00', 3: '2021-04-10 04:00:00',
4: '2021-04-10 05:00:00'},
'col1': {0: 25.0, 1: 25.0, 2: 25.0, 3: 50.0, 4: 100.0},
'col2': {0: 50.0, 1: 50.0, 2: 100.0, 3: 50.0, 4: 100.0},
'col3': {0: 50, 1: 50, 2: 50, 3: 100, 4: 100}})
df['state'] = np.where((df['col2'] <= 50) & (df['col3'] <= 50), df.col1, df.max(axis=1))
print(df)
输出:
datetime col1 col2 col3 state
2021-04-10 01:00:00 25.0 50.0 50 25.0
2021-04-10 02:00:00 25.0 50.0 50 25.0
2021-04-10 03:00:00 25.0 100.0 50 100.0
2021-04-10 04:00:00 50.0 50.0 100 100.0
2021-04-10 05:00:00 100.0 100.0 100 100.0
我有一个示例数据:
datetime col1 col2 col3
2021-04-10 01:00:00 25. 50. 50
2021-04-10 02:00:00. 25. 50. 50
2021-04-10 03:00:00. 25. 100. 50
2021-04-10 04:00:00 50. 50. 100
2021-04-10 05:00:00. 100. 100. 100
我想创建一个名为 state 的新列,如果 col2 和 col3 的值小于或等于 50,则该列 returns col1 值,否则 returns col1、column2 和 column3 之间的最大值.
预期输出如下图:
datetime col1 col2 col3. state
2021-04-10 01:00:00 25. 50. 50. 25
2021-04-10 02:00:00. 25. 50. 50. 25
2021-04-10 03:00:00. 25. 100. 50. 100
2021-04-10 04:00:00 50. 50. 100. 100
2021-04-10 05:00:00. 100. 100. 100. 100
您可以遍历数据框的行并检查条件
values = []
for ind, row in df.iterrows():
if row['col2'] <= 50 & row['col3'] <= 50:
values.append(row['col1'])
else:
values.append(max(row['col1'], row['col2'], row['col3']))
df['state'] = values
print(df)
datetime col1 col2 col3 state
2021-04-10 01:00:00 25 50 50 25
2021-04-10 02:00:00 25 50 50 25
2021-04-10 03:00:00 25 100 50 100
2021-04-10 04:00:00 50 50 100 100
2021-04-10 05:00:00 100 100 100 100
# Create a mask:
# Create a mask for the basic condition
mask1 = ((df['col2'] <= 50) & (df['col3'] <= 50))
# Use loc to select rows where condition is met and input the df['col1'] value in state
df.loc[mask1, 'state'] = df['col1']
# Check for rows where condition is not met ~ does that, input the mean in state.
df.loc[~mask1, 'state'] = (df['col1'] + df['col2'] + df['col3'])/3
为了改进其他答案,我会使用 pandas apply 遍历行并计算新列。
def calc_new_col(row):
if row['col2'] <= 50 & row['col3'] <= 50:
return row['col1']
else:
return max(row['col1'], row['col2'], row['col3'])
df["state"] = df.apply(calc_new_col, axis=1)
# axis=1 makes sure that function is applied to each row
print(df)
datetime col1 col2 col3 state
2021-04-10 01:00:00 25 50 50 25
2021-04-10 02:00:00 25 50 50 25
2021-04-10 03:00:00 25 100 50 100
2021-04-10 04:00:00 50 50 100 100
2021-04-10 05:00:00 100 100 100 100
apply
帮助代码更简洁、更可重用。
使用np.where的选项:
import numpy as np
import pandas as pd
df = pd.DataFrame({'datetime': {0: '2021-04-10 01:00:00', 1: '2021-04-10 02:00:00',
2: '2021-04-10 03:00:00', 3: '2021-04-10 04:00:00',
4: '2021-04-10 05:00:00'},
'col1': {0: 25.0, 1: 25.0, 2: 25.0, 3: 50.0, 4: 100.0},
'col2': {0: 50.0, 1: 50.0, 2: 100.0, 3: 50.0, 4: 100.0},
'col3': {0: 50, 1: 50, 2: 50, 3: 100, 4: 100}})
df['state'] = np.where((df['col2'] <= 50) & (df['col3'] <= 50), df.col1, df.max(axis=1))
print(df)
输出:
datetime col1 col2 col3 state 2021-04-10 01:00:00 25.0 50.0 50 25.0 2021-04-10 02:00:00 25.0 50.0 50 25.0 2021-04-10 03:00:00 25.0 100.0 50 100.0 2021-04-10 04:00:00 50.0 50.0 100 100.0 2021-04-10 05:00:00 100.0 100.0 100 100.0