如何获得逻辑回归的正确答案?
How to get the correct answer for logisitc regression?
我在二进制 class化问题上没有得到想要的输出。
问题是使用二进制 classification 将乳腺癌标记为:
- 良性的,或
- 恶性
它没有提供所需的输出。
首先有一个函数可以加载 return 测试和训练形状数据的数据集:
x_train is of shape: (30, 381),
y_train is of shape: (1, 381),
x_test is of shape: (30, 188),
y_test is of shape: (1, 188).
然后有一个class逻辑回归classifier,预测输出。
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
def load_dataset():
cancer_data = load_breast_cancer()
x_train, x_test, y_train, y_test = train_test_split(cancer_data.data, cancer_data.target, test_size=0.33)
x_train = x_train.T
x_test = x_test.T
y_train = y_train.reshape(1, (len(y_train)))
y_test = y_test.reshape(1, (len(y_test)))
m = x_train.shape[1]
return x_train, x_test, y_train, y_test, m
class Neural_Network():
def __init__(self):
np.random.seed(1)
self.weights = np.random.rand(30, 1) * 0.01
self.bias = np.zeros(shape=(1, 1))
def sigmoid(self, x):
return 1 / (1 + np.exp(-x))
def train(self, x_train, y_train, iterations, m, learning_rate=0.5):
for i in range(iterations):
z = np.dot(self.weights.T, x_train) + self.bias
a = self.sigmoid(z)
cost = (-1 / m) * np.sum(y_train * np.log(a) + (1 - y_train) * np.log(1 - a))
if (i % 500 == 0):
print("Cost after iteration %i: %f" % (i, cost))
dw = (1 / m) * np.dot(x_train, (a - y_train).T)
db = (1 / m) * np.sum(a - y_train)
self.weights = self.weights - learning_rate * dw
self.bias = self.bias - learning_rate * db
def predict(self, inputs):
m = inputs.shape[1]
y_predicted = np.zeros((1, m))
z = np.dot(self.weights.T, inputs) + self.bias
a = self.sigmoid(z)
for i in range(a.shape[1]):
y_predicted[0, i] = 1 if a[0, i] > 0.5 else 0
return y_predicted
if __name__ == "__main__":
'''
step-1 : Loading data set
x_train is of shape: (30, 381)
y_train is of shape: (1, 381)
x_test is of shape: (30, 188)
y_test is of shape: (1, 188)
'''
x_train, x_test, y_train, y_test, m = load_dataset()
neuralNet = Neural_Network()
'''
step-2 : Train the network
'''
neuralNet.train(x_train, y_train,10000,m)
y_predicted = neuralNet.predict(x_test)
print("Accuracy on test data: ")
print(accuracy_score(y_test, y_predicted)*100)
给出此输出的程序:
C:\Python36\python.exe C:/Users/LENOVO/PycharmProjects/MarkDmo001/Numpy.py
Cost after iteration 0: 5.263853
C:/Users/LENOVO/PycharmProjects/MarkDmo001/logisticReg.py:25: RuntimeWarning: overflow encountered in exp
return 1 / (1 + np.exp(-x))
C:/Users/LENOVO/PycharmProjects/MarkDmo001/logisticReg.py:33: RuntimeWarning: divide by zero encountered in log
cost = (-1 / m) * np.sum(y_train * np.log(a) + (1 - y_train) * np.log(1 - a))
C:/Users/LENOVO/PycharmProjects/MarkDmo001/logisticReg.py:33: RuntimeWarning: invalid value encountered in multiply
cost = (-1 / m) * np.sum(y_train * np.log(a) + (1 - y_train) * np.log(1 - a))
Cost after iteration 500: nan
Cost after iteration 1000: nan
Cost after iteration 1500: nan
Cost after iteration 2000: nan
Cost after iteration 2500: nan
Cost after iteration 3000: nan
Cost after iteration 3500: nan
Cost after iteration 4000: nan
Cost after iteration 4500: nan
Cost after iteration 5000: nan
Cost after iteration 5500: nan
Cost after iteration 6000: nan
Cost after iteration 6500: nan
Cost after iteration 7000: nan
Cost after iteration 7500: nan
Cost after iteration 8000: nan
Cost after iteration 8500: nan
Cost after iteration 9000: nan
Cost after iteration 9500: nan
Accuracy:
0.0
问题是梯度爆炸。您需要将输入规范化为 [0, 1]
。
如果您查看训练数据中的特征 3 和特征 23,您会看到大于 3000 的值。在将这些值乘以您的初始权重后,它们仍在 [0, 30]
范围内。因此,在第一次迭代中,z
向量仅包含值最大为 50 左右的正数。因此,a
向量(您的 sigmoid 的输出)看起来像这样:
[0.9994797 0.99853904 0.99358676 0.99999973 0.98392862 0.99983016 0.99818802 ...]
所以在第一步中,您的模型总是以高置信度预测 1。但这并不总是正确的,您的模型输出的高概率会导致较大的梯度,您可以在查看 dw
的最高值时看到这一点。就我而言,
dw[3]
是 388
dw[23]
是 571
,其他值位于 [0, 55]
。因此,您可以清楚地看到这些特征中的大输入是如何导致梯度爆炸的。因为梯度下降现在向相反的方向迈出太大的一步,下一步的权重不在[0, 0.01]
,而是在[-285, 0.002]
,这只会让事情变得更糟。在下一次迭代中,z
包含大约 - 100 万的值,这导致 sigmoid 函数溢出。
解决方案
- 将您的输入规范化为
[0, 1]
- 在
[-0.01, 0.01]
中使用权重,使它们大致相互抵消。否则,您在 z
中的值仍然与您拥有的特征数量成线性比例。
至于标准化输入,你可以使用 sklearn 的 MinMaxScaler
:
x_train, x_test, y_train, y_test, m = load_dataset()
scaler = MinMaxScaler()
x_train_normalized = scaler.fit_transform(x_train.T).T
neuralNet = Neural_Network()
'''
step-2 : Train the network
'''
neuralNet.train(x_train_normalized, y_train,10000,m)
# Use the same transformation on the test inputs as on the training inputs
x_test_normalized = scaler.transform(x_test.T).T
y_predicted = neuralNet.predict(x_test_normalized)
.T
是因为 sklearn 期望训练输入的形状为 (num_samples, num_features)
,而你的 x_train
和 x_test
的形状为 (num_features, num_samples)
。
我在二进制 class化问题上没有得到想要的输出。
问题是使用二进制 classification 将乳腺癌标记为: - 良性的,或 - 恶性
它没有提供所需的输出。
首先有一个函数可以加载 return 测试和训练形状数据的数据集:
x_train is of shape: (30, 381),
y_train is of shape: (1, 381),
x_test is of shape: (30, 188),
y_test is of shape: (1, 188).
然后有一个class逻辑回归classifier,预测输出。
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
def load_dataset():
cancer_data = load_breast_cancer()
x_train, x_test, y_train, y_test = train_test_split(cancer_data.data, cancer_data.target, test_size=0.33)
x_train = x_train.T
x_test = x_test.T
y_train = y_train.reshape(1, (len(y_train)))
y_test = y_test.reshape(1, (len(y_test)))
m = x_train.shape[1]
return x_train, x_test, y_train, y_test, m
class Neural_Network():
def __init__(self):
np.random.seed(1)
self.weights = np.random.rand(30, 1) * 0.01
self.bias = np.zeros(shape=(1, 1))
def sigmoid(self, x):
return 1 / (1 + np.exp(-x))
def train(self, x_train, y_train, iterations, m, learning_rate=0.5):
for i in range(iterations):
z = np.dot(self.weights.T, x_train) + self.bias
a = self.sigmoid(z)
cost = (-1 / m) * np.sum(y_train * np.log(a) + (1 - y_train) * np.log(1 - a))
if (i % 500 == 0):
print("Cost after iteration %i: %f" % (i, cost))
dw = (1 / m) * np.dot(x_train, (a - y_train).T)
db = (1 / m) * np.sum(a - y_train)
self.weights = self.weights - learning_rate * dw
self.bias = self.bias - learning_rate * db
def predict(self, inputs):
m = inputs.shape[1]
y_predicted = np.zeros((1, m))
z = np.dot(self.weights.T, inputs) + self.bias
a = self.sigmoid(z)
for i in range(a.shape[1]):
y_predicted[0, i] = 1 if a[0, i] > 0.5 else 0
return y_predicted
if __name__ == "__main__":
'''
step-1 : Loading data set
x_train is of shape: (30, 381)
y_train is of shape: (1, 381)
x_test is of shape: (30, 188)
y_test is of shape: (1, 188)
'''
x_train, x_test, y_train, y_test, m = load_dataset()
neuralNet = Neural_Network()
'''
step-2 : Train the network
'''
neuralNet.train(x_train, y_train,10000,m)
y_predicted = neuralNet.predict(x_test)
print("Accuracy on test data: ")
print(accuracy_score(y_test, y_predicted)*100)
给出此输出的程序:
C:\Python36\python.exe C:/Users/LENOVO/PycharmProjects/MarkDmo001/Numpy.py
Cost after iteration 0: 5.263853
C:/Users/LENOVO/PycharmProjects/MarkDmo001/logisticReg.py:25: RuntimeWarning: overflow encountered in exp
return 1 / (1 + np.exp(-x))
C:/Users/LENOVO/PycharmProjects/MarkDmo001/logisticReg.py:33: RuntimeWarning: divide by zero encountered in log
cost = (-1 / m) * np.sum(y_train * np.log(a) + (1 - y_train) * np.log(1 - a))
C:/Users/LENOVO/PycharmProjects/MarkDmo001/logisticReg.py:33: RuntimeWarning: invalid value encountered in multiply
cost = (-1 / m) * np.sum(y_train * np.log(a) + (1 - y_train) * np.log(1 - a))
Cost after iteration 500: nan
Cost after iteration 1000: nan
Cost after iteration 1500: nan
Cost after iteration 2000: nan
Cost after iteration 2500: nan
Cost after iteration 3000: nan
Cost after iteration 3500: nan
Cost after iteration 4000: nan
Cost after iteration 4500: nan
Cost after iteration 5000: nan
Cost after iteration 5500: nan
Cost after iteration 6000: nan
Cost after iteration 6500: nan
Cost after iteration 7000: nan
Cost after iteration 7500: nan
Cost after iteration 8000: nan
Cost after iteration 8500: nan
Cost after iteration 9000: nan
Cost after iteration 9500: nan
Accuracy:
0.0
问题是梯度爆炸。您需要将输入规范化为 [0, 1]
。
如果您查看训练数据中的特征 3 和特征 23,您会看到大于 3000 的值。在将这些值乘以您的初始权重后,它们仍在 [0, 30]
范围内。因此,在第一次迭代中,z
向量仅包含值最大为 50 左右的正数。因此,a
向量(您的 sigmoid 的输出)看起来像这样:
[0.9994797 0.99853904 0.99358676 0.99999973 0.98392862 0.99983016 0.99818802 ...]
所以在第一步中,您的模型总是以高置信度预测 1。但这并不总是正确的,您的模型输出的高概率会导致较大的梯度,您可以在查看 dw
的最高值时看到这一点。就我而言,
dw[3]
是 388dw[23]
是 571
,其他值位于 [0, 55]
。因此,您可以清楚地看到这些特征中的大输入是如何导致梯度爆炸的。因为梯度下降现在向相反的方向迈出太大的一步,下一步的权重不在[0, 0.01]
,而是在[-285, 0.002]
,这只会让事情变得更糟。在下一次迭代中,z
包含大约 - 100 万的值,这导致 sigmoid 函数溢出。
解决方案
- 将您的输入规范化为
[0, 1]
- 在
[-0.01, 0.01]
中使用权重,使它们大致相互抵消。否则,您在z
中的值仍然与您拥有的特征数量成线性比例。
至于标准化输入,你可以使用 sklearn 的 MinMaxScaler
:
x_train, x_test, y_train, y_test, m = load_dataset()
scaler = MinMaxScaler()
x_train_normalized = scaler.fit_transform(x_train.T).T
neuralNet = Neural_Network()
'''
step-2 : Train the network
'''
neuralNet.train(x_train_normalized, y_train,10000,m)
# Use the same transformation on the test inputs as on the training inputs
x_test_normalized = scaler.transform(x_test.T).T
y_predicted = neuralNet.predict(x_test_normalized)
.T
是因为 sklearn 期望训练输入的形状为 (num_samples, num_features)
,而你的 x_train
和 x_test
的形状为 (num_features, num_samples)
。