Pandas:处理缺失数据时的真值条件是什么

Pandas: what is truth value condition when dealing with missing data

我有一个创建比率的函数。它被定义为

def create_ratio(data,num,den):
    if data[num].isnull():
        ratio = -9997
    if data[den].isnull():
        ratio = -9998
    if data[num].isnull() & data[den].isnull():
        ratio = -9999
    else:
        ratio = data[num]/data[den]
    return ratio

我有 pandas 数据框 (df_credit),其中包括信用卡余额 (cc_bal) 和限额 (cc_limit),我想计算信用卡利用率余额超过限制

df_credit['cc_util'] = create_ratio(df_credit,'cc_bal','cc_limit')

我收到以下错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-66-d53809a7690d> in <module>
----> 1 data['ratio_cc_util'] = create_ratio(data,'open_credit_card_credit_limit_nomiss','open_credit_card_credit_limit_nomiss')
      2 data['ratio_cc_util'].hist()

<ipython-input-65-99bc55b184ed> in create_ratio(data, num, den)
      1 def create_ratio(data,num,den):
----> 2     if data[num].isnull():
      3         ratio = -9997
      4     if data[den].isnull():
      5         ratio = -9998

/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in __nonzero__(self)
   1441     def __nonzero__(self):
   1442         raise ValueError(
-> 1443             f"The truth value of a {type(self).__name__} is ambiguous. "
   1444             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1445         )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

这个错误的解决方法是什么?谢谢

  • 您混合使用标量和级数,您的函数需要 return 给定调用上下文的级数或数组
  • 实现此条件逻辑的最简单方法是 np.select()
  • 有模拟数据,包括满足您用例的缺失值
df = pd.DataFrame({
        "cc_bal": np.random.uniform(200, 1000, 200),
        "cc_limit": np.random.uniform(800, 1200, 200),})

df.loc[np.unique(np.random.choice(range(len(df)), 30)), "cc_bal"] = None
df.loc[np.unique(np.random.choice(range(len(df)), 30)), "cc_limit"] = None


def create_ratio(df, num, den):
    return np.select(
        [
            df[num].isnull() & df[den].isnull(),
            df[num].isnull(),
            df[den].isnull(),
        ],
        [-9999, -9997, -9998],
        df[num] / df[den],
    )


df["ratio"] = create_ratio(df, "cc_bal", "cc_limit")
df

示例输出

cc_bal cc_limit ratio
0 372.633 981.996 0.379465
1 845.541 1133.69 0.745831
2 449.406 975.903 0.460503
3 209.827 922.829 0.227374
4 237.347 936.654 0.253398
5 351.154 nan -9998
6 nan 873.671 -9997
7 803.396 861.791 0.93224
8 591.136 807.176 0.732352
9 675.397 847.059 0.797344