使用矢量化(更好)或其他方法比较多个 pandas 列(第 1 和第 2,第 3 和第 4 之后,等等)

Compare multiple pandas columns (1st and 2nd, after 3rd and 4rth, after etc) with vectorization (better) or other method

此代码基于 conditionvar1var2 进行比较,并根据 choices 创建 Results1(此代码运行良好):

# from: 
# from: 

import pandas as pd
import numpy as np

# Creating one column from two columns. We asume that in every row there is one NaN and one value and that value fills new column.
df = pd.DataFrame({ 'var1': ['a', 'b', 'c',np.nan, np.nan],
                   'var2': [1, 2, np.nan , 4, np.nan], 
                   'var3': [np.nan , "x", np.nan, "y", "z"],
                   'var4': [np.nan , 4, np.nan, 5, 6],
                   'var5': ["a", np.nan , "b", np.nan, "c"],
                   'var6': [1, np.nan , 2, np.nan, 3]
                 })


#all conditions that are connected with logical operators (&, |, etc) should be in ().
conditions = [
    (df["var1"].notna()) & (df['var2'].notna()),
    (pd.isna(df["var1"])) & (pd.isna(df["var2"])),
    (df["var1"].notna()) & (pd.isna(df["var2"])),
    (pd.isna(df["var1"])) & (df['var2'].notna())]

choices = ["Both values", np.nan, df["var1"], df["var2"]]

df['Result1'] = np.select(conditions, choices, default=np.nan)

df 看起来应该是:

|    | var1   |   var2 | var3   |   var4 | var5   |   var6 | Result1     |
|---:|:-------|-------:|:-------|-------:|:-------|-------:|:------------|
|  0 | a      |      1 | nan    |    nan | a      |      1 | Both values |
|  1 | b      |      2 | x      |      4 | nan    |    nan | Both values |
|  2 | c      |    nan | nan    |    nan | b      |      2 | c           |
|  3 | nan    |      4 | y      |      5 | nan    |    nan | 4           |
|  4 | nan    |    nan | z      |      6 | c      |      3 | nan         |

现在我想比较多个 pandas 列(在我的示例中 var1var2,在 var3var4 之后,在 var5var6) 并基于 conditionchoices 创建相应的 Results 列(在我的示例中 Result1, Result2, Result3).我认为最好的方法应该是使用矢量化(因为性能更好)。我想要得到的 df 应该是这样的:

|    | var1   |   var2 | var3   |   var4 | var5   |   var6 | Result1     | Result2     | Result3     |
|---:|:-------|-------:|:-------|-------:|:-------|-------:|:------------|:------------|:------------|
|  0 | a      |      1 | nan    |    nan | a      |      1 | Both values | nan         | Both values |
|  1 | b      |      2 | x      |      4 | nan    |    nan | Both values | Both values | nan         |
|  2 | c      |    nan | nan    |    nan | b      |      2 | c           | nan         | Both values |
|  3 | nan    |      4 | y      |      5 | nan    |    nan | 4           | Both values | nan         |
|  4 | nan    |    nan | z      |      6 | c      |      3 | nan         | Both values | Both values |

我试过这个:

import pandas as pd
import numpy as np

# Creating one column from two columns. We asume that in every row there is one NaN and one value and that value fills new column.
df = pd.DataFrame({ 'var1': ['a', 'b', 'c',np.nan, np.nan],
                   'var2': [1, 2, np.nan , 4, np.nan], 
                   'var3': [np.nan , "x", np.nan, "y", "z"],
                   'var4': [np.nan , 4, np.nan, 5, 6],
                   'var5': ["a", np.nan , "b", np.nan, "c"],
                   'var6': [1, np.nan , 2, np.nan, 3]
                 })


col1 = ["var1", "var3", "var5"]
col2 = ["var2", "var4", "var6"]
colR = ["Result1", "Result2", "Result3"]

#all conditions that are connected with logical operators (&, |, etc) should be in ().
conditions = [
    (df[col1].notna()) & (df[col2].notna()),
    (pd.isna(df[col1])) & (pd.isna(df[col2])),
    (df[col1].notna()) & (pd.isna(df[col2])),
    (pd.isna(df[col1])) & (df[col2].notna())]

choices = ["Both values", np.nan, df[col1], df[col2]]

df[colR] = np.select(conditions, choices, default=np.nan)

买它给我错误:

ValueError: shape mismatch: objects cannot be broadcast to a single shape

问题:如何通过矢量化(因为性能更好而更可取)或其他方法实现我的目标?

问题是 pandas DataFrame 强制索引对齐,但 df[col1]df[col2] 没有重叠的列。

在这种情况下,您确实希望使用底层的 numpy 数组。也因为 .isnull()notnull 相反,你可以简化很多。我们将连接以添加新列。

col1 = ["var1", "var3", "var5"]
col2 = ["var2", "var4", "var6"]
colR = ["Result1", "Result2", "Result3"]

s1 = df[col1].isnull().to_numpy()
s2 = df[col2].isnull().to_numpy()

conditions = [~s1 & ~s2, s1 & s2, ~s1 & s2, s1 & ~s2]
choices = ["Both values", np.nan, df[col1], df[col2]]

df = pd.concat([df, pd.DataFrame(np.select(conditions, choices), columns=colR, index=df.index)], axis=1)

  var1  var2 var3  var4 var5  var6      Result1      Result2      Result3
0    a   1.0  NaN   NaN    a   1.0  Both values          NaN  Both values
1    b   2.0    x   4.0  NaN   NaN  Both values  Both values          NaN
2    c   NaN  NaN   NaN    b   2.0            c          NaN  Both values
3  NaN   4.0    y   5.0  NaN   NaN            4  Both values          NaN
4  NaN   NaN    z   6.0    c   3.0          NaN  Both values  Both values