具有混合类型特征的 scikit 学习分类器 returns 测试数据准确度为 0%
scikit learn classifier with mixed type features returns 0% accuracy with test data
我是机器学习的新手 python。我想使用 sklearn 中的 DecisionTreeClassifier。由于我的特征部分是数字特征,部分是分类特征,因此我需要转换它们,因为 DecisionTreeClassifier 只接受数字特征作为输入。
为此,我使用了 ColumnTransformer 和管道。思路如下:
- 分类特征和数值特征在单独的管道中进行转换
- 两者组合形成分类器的输入
但是,使用我的测试数据的准确率始终为 0%,而我使用训练数据的准确率约为 85%。
此外,调用 cross_val_score() returns
ValueError: Found unknown categories ['Holand-Netherlands'] in column 7 during transform
这很奇怪,因为我使用这些数据来训练 full_pipeline。使用不同的分类器会导致相同的行为,这让我相信转换存在问题。非常感谢您的帮助!
下面是我的代码:
names = ["age",
"workclass",
"final-weight",
"education",
"education-num",
"martial-status",
"occupation",
"relationship",
"race",
"sex",
"capital-gain",
"capial-loss",
"hours-per-week",
"native-country",
"agrossincome"]
categorical_features = ["workclass", "education", "martial-status", "occupation", "relationship", "race", "sex", "native-country"]
numerical_features = ["age","final-weight", "education-num", "capital-gain", "capial-loss", "hours-per-week"]
features = np.concatenate([categorical_features, numerical_features])
# create pandas dataframe for adult dataset
adult_train = pd.read_csv(filepath_or_buffer= "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" ,
delimiter= ',',
index_col = False,
skipinitialspace = True,
header = None,
names = names )
adult_test = pd.read_csv( filepath_or_buffer= "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test" ,
delimiter= ',',
index_col = False,
skipinitialspace = True,
header = None,
names = names )
adult_test.drop(0, inplace =True)
adult_test.reset_index(inplace = True)
adult_train.replace(to_replace= "?", value = np.NaN, inplace = True)
adult_test.replace(to_replace= "?", value = np.NaN, inplace= True)
# split data into features and targets
x_train = adult_train[features]
y_train = adult_train.agrossincome
x_test = adult_test[features]
y_test = adult_test.agrossincome
# create pipeline for preprocessing + classifier
categorical_pipeline = Pipeline( steps = [ ( 'imputer', SimpleImputer(strategy='constant', fill_value='missing') ),
( 'encoding', OrdinalEncoder() )
])
numerical_pipeline = Pipeline( steps = [ ( 'imputer', SimpleImputer(strategy='median') ),
( 'std_scaler', StandardScaler( with_mean = False ) )
])
preprocessing = ColumnTransformer( transformers = [ ( 'categorical_pipeline', categorical_pipeline, categorical_features ),
( 'numerical_pipeline', numerical_pipeline, numerical_features ) ] )
full_pipeline = Pipeline(steps= [ ('preprocessing', preprocessing),
('model', DecisionTreeClassifier(random_state= 0, max_depth = 5) ) ])
full_pipeline.fit(x_train, y_train)
print(full_pipeline.score(x_test, y_test))
#print(cross_val_score(full_pipeline, x_train, y_train, cv=3).mean())
错误来自 y_test,看起来像
同时
正在删除“.”最后应该修复它
我是机器学习的新手 python。我想使用 sklearn 中的 DecisionTreeClassifier。由于我的特征部分是数字特征,部分是分类特征,因此我需要转换它们,因为 DecisionTreeClassifier 只接受数字特征作为输入。 为此,我使用了 ColumnTransformer 和管道。思路如下:
- 分类特征和数值特征在单独的管道中进行转换
- 两者组合形成分类器的输入
但是,使用我的测试数据的准确率始终为 0%,而我使用训练数据的准确率约为 85%。 此外,调用 cross_val_score() returns
ValueError: Found unknown categories ['Holand-Netherlands'] in column 7 during transform
这很奇怪,因为我使用这些数据来训练 full_pipeline。使用不同的分类器会导致相同的行为,这让我相信转换存在问题。非常感谢您的帮助!
下面是我的代码:
names = ["age",
"workclass",
"final-weight",
"education",
"education-num",
"martial-status",
"occupation",
"relationship",
"race",
"sex",
"capital-gain",
"capial-loss",
"hours-per-week",
"native-country",
"agrossincome"]
categorical_features = ["workclass", "education", "martial-status", "occupation", "relationship", "race", "sex", "native-country"]
numerical_features = ["age","final-weight", "education-num", "capital-gain", "capial-loss", "hours-per-week"]
features = np.concatenate([categorical_features, numerical_features])
# create pandas dataframe for adult dataset
adult_train = pd.read_csv(filepath_or_buffer= "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" ,
delimiter= ',',
index_col = False,
skipinitialspace = True,
header = None,
names = names )
adult_test = pd.read_csv( filepath_or_buffer= "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test" ,
delimiter= ',',
index_col = False,
skipinitialspace = True,
header = None,
names = names )
adult_test.drop(0, inplace =True)
adult_test.reset_index(inplace = True)
adult_train.replace(to_replace= "?", value = np.NaN, inplace = True)
adult_test.replace(to_replace= "?", value = np.NaN, inplace= True)
# split data into features and targets
x_train = adult_train[features]
y_train = adult_train.agrossincome
x_test = adult_test[features]
y_test = adult_test.agrossincome
# create pipeline for preprocessing + classifier
categorical_pipeline = Pipeline( steps = [ ( 'imputer', SimpleImputer(strategy='constant', fill_value='missing') ),
( 'encoding', OrdinalEncoder() )
])
numerical_pipeline = Pipeline( steps = [ ( 'imputer', SimpleImputer(strategy='median') ),
( 'std_scaler', StandardScaler( with_mean = False ) )
])
preprocessing = ColumnTransformer( transformers = [ ( 'categorical_pipeline', categorical_pipeline, categorical_features ),
( 'numerical_pipeline', numerical_pipeline, numerical_features ) ] )
full_pipeline = Pipeline(steps= [ ('preprocessing', preprocessing),
('model', DecisionTreeClassifier(random_state= 0, max_depth = 5) ) ])
full_pipeline.fit(x_train, y_train)
print(full_pipeline.score(x_test, y_test))
#print(cross_val_score(full_pipeline, x_train, y_train, cv=3).mean())
错误来自 y_test,看起来像
同时
正在删除“.”最后应该修复它