pandas 中的新列，其值取决于其他列

Question

我有一个示例数据：

datetime             col1    col2    col3
2021-04-10 01:00:00    25.    50.     50
2021-04-10 02:00:00.   25.    50.     50
2021-04-10 03:00:00.   25.    100.    50
2021-04-10 04:00:00    50.     50.    100
2021-04-10 05:00:00.   100.    100.   100

我想创建一个名为 state 的新列，如果 col2 和 col3 的值小于或等于 50，则该列 returns col1 值，否则 returns col1、column2 和 column3 之间的最大值.

预期输出如下图：

datetime             col1    col2    col3. state
2021-04-10 01:00:00    25.    50.     50.   25
2021-04-10 02:00:00.   25.    50.     50.   25
2021-04-10 03:00:00.   25.    100.    50.   100
2021-04-10 04:00:00    50.     50.    100.  100
2021-04-10 05:00:00.   100.    100.   100.  100

Answer 1

您可以遍历数据框的行并检查条件

values = []

for ind, row in df.iterrows():
    if row['col2'] <= 50 & row['col3'] <= 50:
        values.append(row['col1'])
    else:
        values.append(max(row['col1'], row['col2'], row['col3']))

df['state'] = values

print(df)
            datetime  col1  col2  col3  state
2021-04-10  01:00:00    25    50    50     25
2021-04-10  02:00:00    25    50    50     25
2021-04-10  03:00:00    25   100    50    100
2021-04-10  04:00:00    50    50   100    100
2021-04-10  05:00:00   100   100   100    100

Answer 2

# Create a mask:

# Create a mask for the basic condition
mask1 = ((df['col2'] <= 50) & (df['col3'] <= 50))

# Use loc to select rows where condition is met and input the df['col1'] value in state
df.loc[mask1, 'state'] = df['col1']

# Check for rows where condition is not met ~ does that, input the mean in state.
df.loc[~mask1, 'state'] = (df['col1'] + df['col2'] + df['col3'])/3

Answer 3

为了改进其他答案，我会使用 pandas apply 遍历行并计算新列。

def calc_new_col(row):
   if row['col2'] <= 50 & row['col3'] <= 50:
        return row['col1']
    else:
        return max(row['col1'], row['col2'], row['col3'])

df["state"] = df.apply(calc_new_col, axis=1)
# axis=1 makes sure that function is applied to each row

print(df)
            datetime  col1  col2  col3  state
2021-04-10  01:00:00    25    50    50     25
2021-04-10  02:00:00    25    50    50     25
2021-04-10  03:00:00    25   100    50    100
2021-04-10  04:00:00    50    50   100    100
2021-04-10  05:00:00   100   100   100    100

apply 帮助代码更简洁、更可重用。

Answer 4

使用np.where的选项：

import numpy as np
import pandas as pd

df = pd.DataFrame({'datetime': {0: '2021-04-10 01:00:00', 1: '2021-04-10 02:00:00',
                                2: '2021-04-10 03:00:00', 3: '2021-04-10 04:00:00',
                                4: '2021-04-10 05:00:00'},
                   'col1': {0: 25.0, 1: 25.0, 2: 25.0, 3: 50.0, 4: 100.0},
                   'col2': {0: 50.0, 1: 50.0, 2: 100.0, 3: 50.0, 4: 100.0},
                   'col3': {0: 50, 1: 50, 2: 50, 3: 100, 4: 100}})

df['state'] = np.where((df['col2'] <= 50) & (df['col3'] <= 50), df.col1, df.max(axis=1))

print(df)

输出：

           datetime  col1  col2  col3  state
2021-04-10 01:00:00  25.0  50.0    50   25.0
2021-04-10 02:00:00  25.0  50.0    50   25.0
2021-04-10 03:00:00  25.0 100.0    50  100.0
2021-04-10 04:00:00  50.0  50.0   100  100.0
2021-04-10 05:00:00 100.0 100.0   100  100.0

pandas 中的新列，其值取决于其他列

A new column in pandas which value depends on other columns

python

numpy

pandas

data-science