理解这种逻辑回归的实现
Understanding this implementation of logistic regression
按照 scikit-learn 中逻辑回归的这个示例实现:
https://analyticsdataexploration.com/logistic-regression-using-python/
运行预测后,生成如下:
predictions=modelLogistic.predict(test[predictor_Vars])
predictions
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0,
0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0,
1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
1, 0, 0, 0], dtype=int64)
我无法理解 array
值。我认为它们与逻辑函数有关,并且正在输出它认为标签是什么,但这些值应该在 0 和 1 之间而不是 0 或 1 之间吗?
正在阅读 predict 函数的文档:
predict(X)
Predict class labels for samples in X.
Parameters:
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Samples.
Returns:
C : array, shape = [n_samples]
Predicted class label per sample.
取前 5 个值:返回数组的 0、1、0、0、1 如何将它们解释为标签?
完整代码:
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
train=pd.read_csv('/train.csv')
test=pd.read_csv('/test.csv')
def data_cleaning(train):
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Fare"] = train["Age"].fillna(train["Fare"].median())
train["Embarked"] = train["Embarked"].fillna("S")
train.loc[train["Sex"] == "male", "Sex"] = 0
train.loc[train["Sex"] == "female", "Sex"] = 1
train.loc[train["Embarked"] == "S", "Embarked"] = 0
train.loc[train["Embarked"] == "C", "Embarked"] = 1
train.loc[train["Embarked"] == "Q", "Embarked"] = 2
return train
train=data_cleaning(train)
test=data_cleaning(test)
predictor_Vars = [ "Sex", "Age", "SibSp", "Parch", "Fare"]
X, y = train[predictor_Vars], train.Survived
X.iloc[:5]
y.iloc[:5]
modelLogistic = linear_model.LogisticRegression()
modelLogisticCV= cross_validation.cross_val_score(modelLogistic,X,y,cv=15)
modelLogistic = linear_model.LogisticRegression()
modelLogistic.fit(X,y)
#predict(X) Predict class labels for samples in X.
predictions=modelLogistic.predict(test[predictor_Vars])
更新:
打印测试数据集中的前 10 个元素:
可以看到它匹配数组前 10 个元素的预测:
0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0
所以这些是在将逻辑回归应用于 train
数据集后对 test
数据集的逻辑回归预测。
如文档中所述,predict
函数返回的值是 class 标签(就像您作为 y 提供给 fit
函数的值)。在你的情况下,1 表示存活,0 表示未存活。
如果您想要每个预测的分数,您应该使用 decision_function
其中 returns 值介于 -1 和 1 之间。
我希望这能回答你的问题。
按照 scikit-learn 中逻辑回归的这个示例实现: https://analyticsdataexploration.com/logistic-regression-using-python/
运行预测后,生成如下:
predictions=modelLogistic.predict(test[predictor_Vars])
predictions
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0,
0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0,
1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
1, 0, 0, 0], dtype=int64)
我无法理解 array
值。我认为它们与逻辑函数有关,并且正在输出它认为标签是什么,但这些值应该在 0 和 1 之间而不是 0 或 1 之间吗?
正在阅读 predict 函数的文档:
predict(X)
Predict class labels for samples in X.
Parameters:
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Samples.
Returns:
C : array, shape = [n_samples]
Predicted class label per sample.
取前 5 个值:返回数组的 0、1、0、0、1 如何将它们解释为标签?
完整代码:
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
train=pd.read_csv('/train.csv')
test=pd.read_csv('/test.csv')
def data_cleaning(train):
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Fare"] = train["Age"].fillna(train["Fare"].median())
train["Embarked"] = train["Embarked"].fillna("S")
train.loc[train["Sex"] == "male", "Sex"] = 0
train.loc[train["Sex"] == "female", "Sex"] = 1
train.loc[train["Embarked"] == "S", "Embarked"] = 0
train.loc[train["Embarked"] == "C", "Embarked"] = 1
train.loc[train["Embarked"] == "Q", "Embarked"] = 2
return train
train=data_cleaning(train)
test=data_cleaning(test)
predictor_Vars = [ "Sex", "Age", "SibSp", "Parch", "Fare"]
X, y = train[predictor_Vars], train.Survived
X.iloc[:5]
y.iloc[:5]
modelLogistic = linear_model.LogisticRegression()
modelLogisticCV= cross_validation.cross_val_score(modelLogistic,X,y,cv=15)
modelLogistic = linear_model.LogisticRegression()
modelLogistic.fit(X,y)
#predict(X) Predict class labels for samples in X.
predictions=modelLogistic.predict(test[predictor_Vars])
更新:
打印测试数据集中的前 10 个元素:
可以看到它匹配数组前 10 个元素的预测:
0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0
所以这些是在将逻辑回归应用于 train
数据集后对 test
数据集的逻辑回归预测。
如文档中所述,predict
函数返回的值是 class 标签(就像您作为 y 提供给 fit
函数的值)。在你的情况下,1 表示存活,0 表示未存活。
如果您想要每个预测的分数,您应该使用 decision_function
其中 returns 值介于 -1 和 1 之间。
我希望这能回答你的问题。