使用分类和文本数据作为输入的机器学习分类

Question

我有一个大约 400 行的数据集，其中包含多个分类数据列以及一列文本形式的描述作为我的分类模型的输入。我打算使用 SVM 作为我的分类模型来执行分类。由于模型不能接受非数字数据作为输入，因此我已将输入特征转换为数字数据

我已经为我的描述列执行了 TF-IDF，它将术语转换为矩阵形式。

我是否需要使用标签编码转换分类特征，然后将其与 TF-IDF 合并，然后再将其输入机器学习模型？

Answer 1

使用ColumnTransformer 对具有不同数据类型的列应用不同的管道转换。这是一个例子：

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.svm import SVC


# pipeline for text data
text_features = 'text_column'
text_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data
categorical_features = ['cat_col1', 'cat_col2',]
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# you can add other transformations for other data types

# combine preprocessing with ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('text', text_transformer, text_features),
        ('cat', categorical_transformer, categorical_features)
])

# add model to be part of pipeline
clf_pipe =  Pipeline(steps=[('preprocessor', preprocessor),
                   ("model", SVC())
])

# ...

## you can just use preprocessor by itself
# X_train = preprocessor.fit_transform(X_train)
# X_test = preprocessor.transform(X_test)
# clf_s= SVC().fit(X_train, y_train)
# clf_s.score(X_test, y_test)

## or better, you can use the whole.
# clf_pipe.fit(X_train, y_train) 
# clf_pipe.score(X_test, y_test)

见Scikit-learn Example for more details

使用分类和文本数据作为输入的机器学习分类

Machine Learning Classification using categorical and text data as input

classification

machine-learning

scikit-learn