预处理、重采样和管道——以及两者之间的错误

Question

我有一个包含不同类型变量的数据集：二进制、分类、数字、文本。

 Text                                                  Age      Type           Link           Start    Passed  Default
0 care packag saint luke cathol church wa ...           21.0    organisation    saintlukemclean <2001.0 0   0
1   opportun busi group center food support compan...   23.0    organisation    cfanj           <2003.0 0   0
2   holiday ice rink persh squar depart cultur sit...   98.0    home            culturela       >1975.0 0   0

我使用了不同的转换器，一种用于分类 (OneHotEncoder)，一种用于数值 (SimpleImputer)，一种用于文本变量 (CountVectorizer/TF-IDF):

categorical_preprocessing = OneHotEncoder(handle_unknown='ignore')
# categorical_encoder =  ('CV',CountVectorizer())

numeric_preprocessing = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])

# CountVectorizer
text_preprocessing_cv =  Pipeline(steps=[
    ('CV',CountVectorizer())
]) 

# TF-IDF
text_preprocessing_tfidf = Pipeline(steps=[
    ('TF-IDF',TfidfVectorizer())       
])

转换我的特征并将它们传递到管道中（使用分类器逻辑回归、多项朴素拜耳、随机森林和支持向量机），如下所示：

preprocessing = ColumnTransformer(
    transformers=[
        ('text',text_preprocessing_cv, text_columns)
        ('category', categorical_preprocessing, categorical_columns),
        ('numeric', numeric_preprocessing, numerical_columns)
])

但是，我在这一步遇到了错误：

from sklearn.linear_model import LogisticRegression

clf = Pipeline(steps=[('preprocessor', preprocessing),
                      ('classifier', LogisticRegression())])

clf.fit(X_train, y_train) # <-- error

ValueError: Selected columns, ['Age','Default'] are not unique in dataframe.

这个错误可能是由于我的过采样或我对特征进行预处理的方式引起的...正确的重采样顺序应该是仅将其应用于训练集以避免过度拟合，但事实并非如此如果我需要考虑不同类型的变量和变换器 before/after 重采样，我很清楚。

如果您能帮助我修复错误，让管道使用这些预处理工作，我将不胜感激。谢谢

请参考代码：

text_columns = ['Text']
    categorical_columns = ['Type', 'Link','Start']
    numerical_columns = ['Age','Default'] # can I consider the boolean as numerical?
            
          
        
    X = df[categorical_columns + numerical_columns+text_columns]
    y=  df['Passed']

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, stratify=y, random_state=42)
            
     
    # Returning to one dataframe
    training_set = pd.concat([X_train, y_train], axis=1) # need for re-sampling technique
          
    passed=training_set[training_set['Passed']==1]
    not_passed=training_set[training_set['Passed']==0]

    # Oversampling the minority 
    oversample = resample(passed, 
                           replace=True, 
                     

  n_samples=len(not_passed),

# Returning to new training set
oversample_train = pd.concat([not_passed, oversample])
    
 train_df = oversample_train.copy() # this train set is after applying the re-sampling
 test_df = pd.concat([X_test, y_test], axis=1)

X_train=train_df.loc[:,train_df.columns !='Passed']
y_train=train_df[['Passed']

categorical_encoder = OneHotEncoder(handle_unknown='ignore')
numerical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])
text_transformer_cv =  Pipeline(steps=[
    ('cntvec',CountVectorizer())
]) 
 

# TF-IDF
text_preprocessing_tfidf = Pipeline(steps=[
    ('TF-IDF',TfidfVectorizer())       
]) # TF-IDF
       
preprocessing = ColumnTransformer(
    transformers=
    [('category', categorical_encoder, categorical_columns),
     ('numeric', numerical_pipe, numerical_columns), # I think this is causing the error. But I do not know why not also categorical columns
     ('text',text_transformer_cv, text_columns)
])

clf = Pipeline(steps=[('preprocessor', preprocessing),
                      ('classifier', LogisticRegression())])

clf.fit(X_train, y_train)
   
```

Answer 1

问题在于单个文本列的传递方式。我希望 scikit-learn 的未来版本将允许 ['Text',] 但在那之前直接传递它：

...

text_columns = 'Text' # instead of ['Text']

preprocessing = ColumnTransformer(
    transformers=[
        ('text', text_preprocessing_cv, text_columns),
        ('category', categorical_preprocessing,
            categorical_columns), 
        ('numeric', numeric_preprocessing, numerical_columns)
    ],
    remainder='passthrough'
)

预处理、重采样和管道——以及两者之间的错误

Pre-processing, resampling and pipelines - and an error in between

python

encoding

pipeline

machine-learning

scikit-learn