当自定义转换器中的 numpy dtype 为“对象”时，将 nan 转换为零

Question

我正在使用 pandas read_csv 读取一个 csv 文件，该文件的数字列 space 为空。我将读取的数据帧传递到调用自定义转换器的列转换器中。 dataframe 被转换为 numpy 数组并传递给上面的客户转换器。

正是在这个转换器中，我试图替换我无法这样做的 nan 值

深入研究 SO 我发现所有这些解决方案到目前为止都没有任何效果

以下是我自定义的转换器

class Xfrmer_replacenum(BaseEstimator, TransformerMixin):
    """
        this transformer does the global repplace within the dataframe
        replace 365243 spcific to this case study with 0
        replace +/-inf , nan with zero
    """
    # constructor
    def __init__(self):
        #we are not going to use this 
        self._features = None
        
    #Return self 
    def fit(self, X,y=None  ):
        return self
    
    def transform(self,X,y=None):     
        print(X)
        print(X.dtype)
        X = X.astype(float)
        #replace high values with zero
        #or col in X.columns:
        #for col in X.columns:
        X[X==365243.0] = 0
        X[X==365243] = 0
        #np.where(X_==X_,X_,0)
        #np.nan_to_num(X[0, :].astype(np.float64))
        #X = np.nan_to_num(X.astype(np.float64))
        #X = X.astype(str).replace('nan', 0).astype(float)
        #np.frompyfunc(lambda x: x.replace(',',''),1,1)(X).astype(float)
        np.array([v.replace(',', '') for v in X], dtype=np.float32)
        print('replaced values')
        #X=X.replace([np.inf,-np.inf],np.nan)
        #X=X.replace(np.nan,0)    
        print('all replace',X.shape)
        print('just before ret',X[X==365243])
        np.savetxt("./output/prvapln_colxfrmr_onlynum.csv", X,fmt='%s',delimiter=",")
        return X

这是我在列转换器中使用自定义转换器的方式




    lst_cols =["ID1","ID2","AMT_CREDIT_SUM","AMT_CREDIT_SUM_DEBT","AMT_CREDIT_SUM_LIMIT","AMT_CREDIT_SUM_OVERDUE"]

        lst_idx =[]

    lst_all_cols = X_train.columns.values.tolist()

        for col in lst_cols:
                
            idx = lst_all_cols.index(col)
            lst_idx.append(idx)

        preprcs_stg1_pipln = ColumnTransformer( transformers = [
                                                    ( 'repl_pipln', Xfrmer_replacenum(),lst_idx)],
                                                    remainder='passthrough')

这些是我尝试在自定义转换器中使用的各种东西：

        np.where(X_==X_,X_,0)
        
        X = np.nan_to_num(X.astype(np.float64))

        X = X.astype(str).replace('nan', 0).astype(float)

        np.frompyfunc(lambda x: x.replace(',',''),1,1)(X).astype(float)

        np.array([v.replace(',', '') for v in X], dtype=np.float32)

我的数据是这样的

[[2163253 154602 4187.34 -1230.0 -1226.0 0.0]
 [1676258 433469 22242.825 -1343.0 -1334.0 1.0]
 [2075578 418383 7656.705 -2341.0 -2332.0 0.0]
 [1548737 391536 21416.85 nan nan nan]
 [2721491 292308 3959.1 -2604.0 -2601.0 1.0]
 [2595549 432416 3951.225 -540.0 -537.0 0.0]]
object

对于那些已经替换的我得到这个错误

AttributeError: 'numpy.ndarray' 对象没有属性 'replace'

我的问题

如何替换上面dtype对象的ndarray中的nan？

Answer 1

您要查找的是找到所有 np.nan 列，然后使用以下代码将它们设置为关注项

a[np.isnan(a)] = 1 # or whatever value

当自定义转换器中的 numpy dtype 为“对象”时，将 nan 转换为零

Convert nan to Zero when numpy dtype is “object” within custom transformer

python

pipeline

numpy

pandas