Python 逻辑回归
Python Logistic Regression
我已经做了几个小时了,现在感觉真的卡住了。
我正在尝试使用 csv "ScoreBuckets.csv" 中的一堆列来预测该 csv 中名为 "Score_Bucket" 的另一列。我想在 csv 中使用多个列来预测列 Score_Bucket。我遇到的问题是我的结果根本没有任何意义,我不知道如何使用多列来预测列Score_Bucket。我是数据挖掘的新手,所以我不是 100% 熟悉 code/syntax。
这是我目前的代码:
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
dataset = pd.read_csv('ScoreBuckets.csv')
CV = (dataset.Score_Bucket.reshape((len(dataset.Score_Bucket), 1))).ravel()
data = (dataset.ix[:,'CourseLoad_RelativeStudy':'Sleep_Sex'].values).reshape(
(len(dataset.Score_Bucket), 2))
# Create a KNN object
LogReg = LogisticRegression()
# Train the model using the training sets
LogReg.fit(data, CV)
# the model
print('Coefficients (m): \n', LogReg.coef_)
print('Intercept (b): \n', LogReg.intercept_)
#predict the class for each data point
predicted = LogReg.predict(data)
print("Predictions: \n", np.array([predicted]).T)
# predict the probability/likelihood of the prediction
print("Probability of prediction: \n",LogReg.predict_proba(data))
modelAccuracy = LogReg.score(data,CV)
print("Accuracy score for the model: \n", LogReg.score(data,CV))
print(metrics.confusion_matrix(CV, predicted, labels=["Yes","No"]))
# Calculating 5 fold cross validation results
LogReg = LogisticRegression()
kf = KFold(len(CV), n_folds=5)
scores = cross_val_score(LogReg, data, CV, cv=kf)
print("Accuracy of every fold in 5 fold cross validation: ", abs(scores))
print("Mean of the 5 fold cross-validation: %0.2f" % abs(scores.mean()))
print("The accuracy difference between model and KFold is: ",
abs(abs(scores.mean())-modelAccuracy))
ScoreBuckets.csv:
Score_Bucket,Healthy,Course_Load,Miss_Class,Relative_Study,Faculty,Sleep,Relation_Status,Sex,Relative_Stress,Res_Gym?,Tuition_Awareness,Satisfaction,Healthy_TuitionAwareness,Healthy_TuitionAwareness_MissClass,Healthy_MissClass_Sex,Sleep_Faculty_RelativeStress,TuitionAwareness_ResGym,CourseLoad_RelativeStudy,Sleep_Sex
5,0.5,1,0,1,0.4,0.33,1,0,0.5,1,0,0,0.75,0.5,0.17,0.41,0.5,1,0.17
2,1,1,0.33,0.5,0.4,0.33,0,0,1,0,0,0,0.5,0.44,0.44,0.58,0,0.75,0.17
5,0.5,1,0,0.5,0.4,0.33,1,0,0.5,0,1,0,0.75,0.5,0.17,0.41,0.5,0.75,0.17
4,0.5,1,0,0,0.4,0.33,0,0,0.5,0,1,0,0.25,0.17,0.17,0.41,0.5,0.5,0.17
5,0.5,1,0.33,0.5,0.4,0,1,1,1,0,1,0,0.75,0.61,0.61,0.47,0.5,0.75,0.5
5,0.5,1,0,1,0.4,0.33,1,1,1,1,1,1,0.75,0.5,0.5,0.58,1,1,0.67
5,0.5,1,0,0,0.4,0.33,0,0,0.5,0,1,0,0.25,0.17,0.17,0.41,0.5,0.5,0.17
2,0.5,1,0.67,0.5,0.4,0,1,1,0.5,0,0,0,0.75,0.72,0.72,0.3,0,0.75,0.5
5,0.5,1,0,1,0.4,0.33,0,1,1,0,1,1,0.25,0.17,0.5,0.58,0.5,1,0.67
5,1,1,0,0.5,0.4,0.33,0,1,0.5,0,1,1,0.5,0.33,0.67,0.41,0.5,0.75,0.67
0,0.5,1,0,1,0.4,0.33,0,0,0.5,0,0,0,0.25,0.17,0.17,0.41,0,1,0.17
2,0.5,1,0,0.5,0.4,0.33,1,1,1,0,0,0,0.75,0.5,0.5,0.58,0,0.75,0.67
5,0.5,1,0,1,0.4,0.33,0,0,1,1,1,0,0.25,0.17,0.17,0.58,1,1,0.17
0,0.5,1,0.33,0.5,0.4,0.33,1,1,0.5,0,1,0,0.75,0.61,0.61,0.41,0.5,0.75,0.67
5,0.5,1,0,0.5,0.4,0.33,0,0,0.5,0,1,1,0.25,0.17,0.17,0.41,0.5,0.75,0.17
4,0,1,0.67,0.5,0.4,0.67,1,0,0.5,1,0,0,0.5,0.56,0.22,0.52,0.5,0.75,0.34
2,0.5,1,0.33,1,0.4,0.33,0,0,0.5,0,1,0,0.25,0.28,0.28,0.41,0.5,1,0.17
5,0.5,1,0.33,0.5,0.4,0.33,0,1,1,0,1,0,0.25,0.28,0.61,0.58,0.5,0.75,0.67
5,0.5,1,0,1,0.4,0.33,0,0,0.5,1,1,0,0.25,0.17,0.17,0.41,1,1,0.17
5,0.5,1,0.33,0.5,0.4,0.33,1,1,1,0,1,0,0.75,0.61,0.61,0.58,0.5,0.75,0.67
输出:
Coefficients (m):
[[-0.4012899 -0.51699939]
[-0.72785212 -0.55622303]
[-0.62116232 0.30564259]
[ 0.04222459 -0.01672418]]
Intercept (b):
[-1.80383738 -1.5156701 -1.29452772 0.67672118]
Predictions:
[[5]
[5]
[5]
[5]
...
[5]
[5]
[5]
[5]]
Probability of prediction:
[[ 0.09302973 0.08929139 0.13621146 0.68146742]
[ 0.09777325 0.10103782 0.14934111 0.65184782]
[ 0.09777325 0.10103782 0.14934111 0.65184782]
[ 0.10232068 0.11359509 0.16267645 0.62140778]
...
[ 0.07920945 0.08045552 0.17396476 0.66637027]
[ 0.07920945 0.08045552 0.17396476 0.66637027]
[ 0.07920945 0.08045552 0.17396476 0.66637027]
[ 0.07346886 0.07417316 0.18264008 0.66971789]]
Accuracy score for the model:
0.671171171171
[[0 0]
[0 0]]
Accuracy of every fold in 5 fold cross validation:
[ 0.64444444 0.73333333 0.68181818 0.63636364 0.65909091]
Mean of the 5 fold cross-validation: 0.67
The accuracy difference between model and KFold is: 0.00016107016107
我说输出没有意义的原因有两个:
1. 无论我为该列提供什么数据,预测准确性都保持不变,这不应该发生,因为某些列更能预测 Score_Buckets 列。
2. 它不会让我使用多列来预测列 Score_Buckets 因为它说它们必须具有相同的大小,但是当多列显然比只有列 Score_Buckets.
我做错了什么预测?
首先,仔细检查你的问题是否真的可以被定义为分类问题,或者它是否应该被表述为回归问题。
假设您真的想将数据分类到 Score_Bucket
列中存在的四个唯一 类 中,为什么您认为不能使用多个列作为预测变量?事实上,您正在使用示例中的最后两列。如果您认为 sklearn
方法直接与 Pandas DataFrames 一起工作(不需要转换为 NumPy 数组),您可以使您的代码更具可读性:
X = dataset[["CourseLoad_RelativeStudy", "Sleep_Sex"]]
y = dataset[["Score_Bucket"]]
logreg = LogisticRegression()
logreg.fit(X, y)
如果你想select多列,你可以使用loc
方法:
X = dataset.loc[:, "Healthy":"Sleep_Sex"]
您还可以按索引 select 列:
X = dataset.iloc[:, 1:]
关于你的第二个问题,根据我将哪些列用作特征,我确实从交叉验证过程中得到了不同的结果。请注意,您的样本数量非常少 (20),这使得您的估计预测相当多变。
我已经做了几个小时了,现在感觉真的卡住了。
我正在尝试使用 csv "ScoreBuckets.csv" 中的一堆列来预测该 csv 中名为 "Score_Bucket" 的另一列。我想在 csv 中使用多个列来预测列 Score_Bucket。我遇到的问题是我的结果根本没有任何意义,我不知道如何使用多列来预测列Score_Bucket。我是数据挖掘的新手,所以我不是 100% 熟悉 code/syntax。
这是我目前的代码:
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
dataset = pd.read_csv('ScoreBuckets.csv')
CV = (dataset.Score_Bucket.reshape((len(dataset.Score_Bucket), 1))).ravel()
data = (dataset.ix[:,'CourseLoad_RelativeStudy':'Sleep_Sex'].values).reshape(
(len(dataset.Score_Bucket), 2))
# Create a KNN object
LogReg = LogisticRegression()
# Train the model using the training sets
LogReg.fit(data, CV)
# the model
print('Coefficients (m): \n', LogReg.coef_)
print('Intercept (b): \n', LogReg.intercept_)
#predict the class for each data point
predicted = LogReg.predict(data)
print("Predictions: \n", np.array([predicted]).T)
# predict the probability/likelihood of the prediction
print("Probability of prediction: \n",LogReg.predict_proba(data))
modelAccuracy = LogReg.score(data,CV)
print("Accuracy score for the model: \n", LogReg.score(data,CV))
print(metrics.confusion_matrix(CV, predicted, labels=["Yes","No"]))
# Calculating 5 fold cross validation results
LogReg = LogisticRegression()
kf = KFold(len(CV), n_folds=5)
scores = cross_val_score(LogReg, data, CV, cv=kf)
print("Accuracy of every fold in 5 fold cross validation: ", abs(scores))
print("Mean of the 5 fold cross-validation: %0.2f" % abs(scores.mean()))
print("The accuracy difference between model and KFold is: ",
abs(abs(scores.mean())-modelAccuracy))
ScoreBuckets.csv:
Score_Bucket,Healthy,Course_Load,Miss_Class,Relative_Study,Faculty,Sleep,Relation_Status,Sex,Relative_Stress,Res_Gym?,Tuition_Awareness,Satisfaction,Healthy_TuitionAwareness,Healthy_TuitionAwareness_MissClass,Healthy_MissClass_Sex,Sleep_Faculty_RelativeStress,TuitionAwareness_ResGym,CourseLoad_RelativeStudy,Sleep_Sex
5,0.5,1,0,1,0.4,0.33,1,0,0.5,1,0,0,0.75,0.5,0.17,0.41,0.5,1,0.17
2,1,1,0.33,0.5,0.4,0.33,0,0,1,0,0,0,0.5,0.44,0.44,0.58,0,0.75,0.17
5,0.5,1,0,0.5,0.4,0.33,1,0,0.5,0,1,0,0.75,0.5,0.17,0.41,0.5,0.75,0.17
4,0.5,1,0,0,0.4,0.33,0,0,0.5,0,1,0,0.25,0.17,0.17,0.41,0.5,0.5,0.17
5,0.5,1,0.33,0.5,0.4,0,1,1,1,0,1,0,0.75,0.61,0.61,0.47,0.5,0.75,0.5
5,0.5,1,0,1,0.4,0.33,1,1,1,1,1,1,0.75,0.5,0.5,0.58,1,1,0.67
5,0.5,1,0,0,0.4,0.33,0,0,0.5,0,1,0,0.25,0.17,0.17,0.41,0.5,0.5,0.17
2,0.5,1,0.67,0.5,0.4,0,1,1,0.5,0,0,0,0.75,0.72,0.72,0.3,0,0.75,0.5
5,0.5,1,0,1,0.4,0.33,0,1,1,0,1,1,0.25,0.17,0.5,0.58,0.5,1,0.67
5,1,1,0,0.5,0.4,0.33,0,1,0.5,0,1,1,0.5,0.33,0.67,0.41,0.5,0.75,0.67
0,0.5,1,0,1,0.4,0.33,0,0,0.5,0,0,0,0.25,0.17,0.17,0.41,0,1,0.17
2,0.5,1,0,0.5,0.4,0.33,1,1,1,0,0,0,0.75,0.5,0.5,0.58,0,0.75,0.67
5,0.5,1,0,1,0.4,0.33,0,0,1,1,1,0,0.25,0.17,0.17,0.58,1,1,0.17
0,0.5,1,0.33,0.5,0.4,0.33,1,1,0.5,0,1,0,0.75,0.61,0.61,0.41,0.5,0.75,0.67
5,0.5,1,0,0.5,0.4,0.33,0,0,0.5,0,1,1,0.25,0.17,0.17,0.41,0.5,0.75,0.17
4,0,1,0.67,0.5,0.4,0.67,1,0,0.5,1,0,0,0.5,0.56,0.22,0.52,0.5,0.75,0.34
2,0.5,1,0.33,1,0.4,0.33,0,0,0.5,0,1,0,0.25,0.28,0.28,0.41,0.5,1,0.17
5,0.5,1,0.33,0.5,0.4,0.33,0,1,1,0,1,0,0.25,0.28,0.61,0.58,0.5,0.75,0.67
5,0.5,1,0,1,0.4,0.33,0,0,0.5,1,1,0,0.25,0.17,0.17,0.41,1,1,0.17
5,0.5,1,0.33,0.5,0.4,0.33,1,1,1,0,1,0,0.75,0.61,0.61,0.58,0.5,0.75,0.67
输出:
Coefficients (m):
[[-0.4012899 -0.51699939]
[-0.72785212 -0.55622303]
[-0.62116232 0.30564259]
[ 0.04222459 -0.01672418]]
Intercept (b):
[-1.80383738 -1.5156701 -1.29452772 0.67672118]
Predictions:
[[5]
[5]
[5]
[5]
...
[5]
[5]
[5]
[5]]
Probability of prediction:
[[ 0.09302973 0.08929139 0.13621146 0.68146742]
[ 0.09777325 0.10103782 0.14934111 0.65184782]
[ 0.09777325 0.10103782 0.14934111 0.65184782]
[ 0.10232068 0.11359509 0.16267645 0.62140778]
...
[ 0.07920945 0.08045552 0.17396476 0.66637027]
[ 0.07920945 0.08045552 0.17396476 0.66637027]
[ 0.07920945 0.08045552 0.17396476 0.66637027]
[ 0.07346886 0.07417316 0.18264008 0.66971789]]
Accuracy score for the model:
0.671171171171
[[0 0]
[0 0]]
Accuracy of every fold in 5 fold cross validation:
[ 0.64444444 0.73333333 0.68181818 0.63636364 0.65909091]
Mean of the 5 fold cross-validation: 0.67
The accuracy difference between model and KFold is: 0.00016107016107
我说输出没有意义的原因有两个: 1. 无论我为该列提供什么数据,预测准确性都保持不变,这不应该发生,因为某些列更能预测 Score_Buckets 列。 2. 它不会让我使用多列来预测列 Score_Buckets 因为它说它们必须具有相同的大小,但是当多列显然比只有列 Score_Buckets.
我做错了什么预测?
首先,仔细检查你的问题是否真的可以被定义为分类问题,或者它是否应该被表述为回归问题。
假设您真的想将数据分类到 Score_Bucket
列中存在的四个唯一 类 中,为什么您认为不能使用多个列作为预测变量?事实上,您正在使用示例中的最后两列。如果您认为 sklearn
方法直接与 Pandas DataFrames 一起工作(不需要转换为 NumPy 数组),您可以使您的代码更具可读性:
X = dataset[["CourseLoad_RelativeStudy", "Sleep_Sex"]]
y = dataset[["Score_Bucket"]]
logreg = LogisticRegression()
logreg.fit(X, y)
如果你想select多列,你可以使用loc
方法:
X = dataset.loc[:, "Healthy":"Sleep_Sex"]
您还可以按索引 select 列:
X = dataset.iloc[:, 1:]
关于你的第二个问题,根据我将哪些列用作特征,我确实从交叉验证过程中得到了不同的结果。请注意,您的样本数量非常少 (20),这使得您的估计预测相当多变。