只有一个数字特征的逻辑回归

Question

当您只有一个数字特征时，使用 scikit-learn 的 LogisticRegression 求解器的正确方法是什么？

我运行一个我发现很难解释的简单例子。谁能解释一下我在这里做错了什么？

import pandas
import numpy as np
from sklearn.linear_model import LogisticRegression

X = [1, 2, 3, 10, 11, 12]
X = np.reshape(X, (6, 1))
Y = [0, 0, 0, 1, 1, 1]
Y = np.reshape(Y, (6, 1))

lr = LogisticRegression()

lr.fit(X, Y)
print ("2 --> {0}".format(lr.predict(2)))
print ("4 --> {0}".format(lr.predict(4)))

这是我在脚本完成后得到的输出运行。 4 的预测不应该是 0 因为根据高斯分布 4 更接近根据测试集分类为 0 的分布吗？

2 --> [0]
4 --> [1]

当您只有一列数字数据时，逻辑回归采用什么方法？

Answer 1

您正确处理了单个特征，但您错误地假设仅仅因为 4 接近 0 class 个特征，它也将被预测为这样

您可以绘制训练数据和 sigmoid 函数，假设 y=0.5 的阈值用于 class化，并使用从回归模型中学习的系数和截距：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

X = [1, 2, 3, 10, 11, 12]
X = np.reshape(X, (6, 1))
Y = [0, 0, 0, 1, 1, 1]
Y = np.reshape(Y, (6, 1))

lr = LogisticRegression()
lr.fit(X, Y)

plt.figure(1, figsize=(4, 3))
plt.scatter(X.ravel(), Y, color='black', zorder=20)

def model(x):
    return 1 / (1 + np.exp(-x))

X_test = np.linspace(-5, 15, 300)
loss = model(X_test * lr.coef_ + lr.intercept_).ravel()

plt.plot(X_test, loss, color='red', linewidth=3)
plt.axhline(y=0, color='k', linestyle='-')
plt.axhline(y=1, color='k', linestyle='-')
plt.axhline(y=0.5, color='b', linestyle='--')
plt.axvline(x=X_test[123], color='b', linestyle='--')

plt.ylabel('y')
plt.xlabel('X')
plt.xlim(0, 13)
plt.show()

在你的例子中，sigmoid 函数是这样的：

放大一点：

对于您的特定型号，当 Y 处于 0.5 class化阈值时 X 的值介于 3.161 和 3.227 之间。您可以通过比较 loss 和 X_test 数组来检查这一点（X_test[123] 是与上限关联的 X 值 - 您可以使用一些函数优化方法来获得精确值，如果你想要）

所以 4 被预测为 class 1 的原因是因为 4 高于 Y == 0.5

时的界限

您可以通过以下方式进一步展示这一点：

print ("2 --> {0}".format(lr.predict(2)))
print ("3 --> {0}".format(lr.predict(3)))
print ("3.1 --> {0}".format(lr.predict(3.1)))
print ("3.3 --> {0}".format(lr.predict(3.3)))
print ("4 --> {0}".format(lr.predict(4)))

这将打印出以下内容：

2 --> [0]
3 --> [0]
3.1 --> [0]  # Below threshold
3.3 --> [1]  # Above threshold
4 --> [1]

Answer 2

我更改了您的代码中的一些内容，出现了预期的结果：

import numpy as np
from sklearn.linear_model import LogisticRegression

X_train = np.array([1, 2, 3, 10, 11, 12]).reshape(-1, 1)
y_train = np.array([0, 0, 0, 1, 1, 1])

logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)
results = logistic_regression.predict(np.array([2,4,6.4,6.5]).reshape(-1,1))

print('2--> {}'.format(results[0]))
print('4--> {}'.format(results[1]))
print('6.4 --> {}'.format(results[2]))
print('6.5 --> {}'.format(results[3]))

结果是：

'2--> 0'
'4--> 0'
'6.4--> 0'
'6.5--> 1'

我认为您得到了错误的结果，因为您不需要重塑 Y 数组...

只有一个数字特征的逻辑回归

Logistic Regression with just ONE numeric feature

python

machine-learning

scikit-learn

logistic-regression