为什么逻辑回归中较高的学习率会产生 NaN 成本？

Question

总结

我正在使用 Octave 和 Ling-Spam 语料库构建垃圾邮件与非垃圾邮件的分类器；我的分类方法是逻辑回归

更高的学习率会导致为成本计算 NaN 值，但它不会 break/decrease 分类器本身的性能。

我的尝试

注意：我的数据集已经使用均值归一化进行了归一化。 在尝试选择我的学习率时，我从 0.1 和 400 次迭代开始。这导致了以下情节：

1 - 图 1

当他的线条在几次迭代后完全消失时，这是由于产生了 NaN 值；我认为这会导致参数值损坏，从而导致精度下降，但在检查精度时，我发现它在测试集上的精度为 95%（这意味着梯度下降显然仍在发挥作用）。我检查了学习率和迭代的不同值以查看图形如何变化：

2 - 图 2

线条不再消失，这意味着没有 NaN 值，但准确率为 87%，大大降低。

我又做了两次测试，迭代次数更多，学习率略高，在这两个测试中，图表都像预期的那样随着迭代次数减少，但准确率约为 86-88%。那里也没有 NaN。

我意识到我的数据集有偏差，只有 481 封垃圾邮件和 2412 封垃圾邮件。因此，我计算了这些不同组合中的每一个的 FScore，希望发现后面的组合具有更高的 FScore，并且准确性是由于偏斜造成的。事实也并非如此——我已将我的结果总结为 table:

3 - Table

所以没有过拟合，偏斜似乎也不是问题；我现在不知道怎么办！

~~我唯一能想到的是我对accuracy和FScore的计算有误，或者我最初调试的'disappearing'行是错误的~~

编辑：这个问题的关键是为什么 NaN 值出现在那些选择的学习率。所以我降低学习率的临时解决方案并没有真正回答我的问题——我一直认为更高的学习率只是发散而不是收敛，不会产生 NaN 值。

我的代码

我的 main.m 代码（禁止从文件中获取数据集）：

numRecords = length(labels);

trainingSize = ceil(numRecords*0.6);
CVSize = trainingSize + ceil(numRecords*0.2);

featureData = normalise(data);

featureData = [ones(numRecords, 1), featureData];

numFeatures = size(featureData, 2);

featuresTrain = featureData(1:(trainingSize-1),:);
featuresCV = featureData(trainingSize:(CVSize-1),:);
featuresTest = featureData(CVSize:numRecords,:);

labelsTrain = labels(1:(trainingSize-1),:);
labelsCV = labels(trainingSize:(CVSize-1),:);
labelsTest = labels(CVSize:numRecords,:);

paramStart = zeros(numFeatures, 1);

learningRate = 0.0001;
iterations = 400;

[params] = gradDescent(featuresTrain, labelsTrain, learningRate, iterations, paramStart, featuresCV, labelsCV);

threshold = 0.5;
[accuracy, precision, recall] = predict(featuresTest, labelsTest, params, threshold);
fScore = (2*precision*recall)/(precision+recall);

我的gradDescent.m代码：

function [optimParams] = gradDescent(features, labels, learningRate, iterations, paramStart, featuresCV, labelsCV)

x_axis = [];
J_axis = [];
J_CV = [];

params = paramStart;

for i=1:iterations,
  [cost, grad] = costFunction(features, labels, params);
  [cost_CV] = costFunction(featuresCV, labelsCV, params);

  params = params - (learningRate.*grad);

  x_axis = [x_axis;i];
  J_axis = [J_axis;cost];
  J_CV = [J_CV;cost_CV];
endfor

graphics_toolkit("gnuplot")
plot(x_axis, J_axis, 'r', x_axis, J_CV, 'b');
legend("Training", "Cross-Validation");
xlabel("Iterations");
ylabel("Cost");
title("Cost as a function of iterations");

optimParams = params;
endfunction

我的costFunction.m代码：

function [cost, grad] = costFunction(features, labels, params)
  numRecords = length(labels);

  hypothesis = sigmoid(features*params);

  cost = (-1/numRecords)*sum((labels).*log(hypothesis)+(1-labels).*log(1-hypothesis));

  grad = (1/numRecords)*(features'*(hypothesis-labels));
endfunction

我的predict.m代码：

function [accuracy, precision, recall] = predict(features, labels, params, threshold)
numRecords=length(labels);

predictions = sigmoid(features*params)>threshold;

correct = predictions == labels;

truePositives = sum(predictions == labels == 1);
falsePositives = sum((predictions == 1) != labels);
falseNegatives = sum((predictions == 0) != labels);

precision = truePositives/(truePositives+falsePositives);
recall = truePositives/(truePositives+falseNegatives);

accuracy = 100*(sum(correct)/numRecords);
endfunction

Answer 1

信用到期：

这个答案对这里有很大帮助：所以这个问题有点重复，但我没有意识到，一开始并不明显...我会做我的最好尝试解释为什么该解决方案也有效，以避免简单地复制答案。

解决方案：

问题实际上是我的数据中出现的 0*log(0) = NaN 结果。要修复它，在我的成本计算中，它变成了：

cost = (-1/numRecords)*sum((labels).*log(hypothesis)+(1-labels).*log(1-hypothesis+eps(numRecords, 1)));

（请参阅变量值等问题，仅当此行更改时包含其余部分似乎是多余的）

解释：

eps()函数定义如下：

Return a scalar, matrix or N-dimensional array whose elements are all eps, the machine precision.

More precisely, eps is the relative spacing between any two adjacent numbers in the machine’s floating point system. This number is obviously system dependent. On machines that support IEEE floating point arithmetic, eps is approximately 2.2204e-16 for double precision and 1.1921e-07 for single precision.

When called with more than one argument the first two arguments are taken as the number of rows and columns and any further arguments specify additional matrix dimensions. The optional argument class specifies the return type and may be either "double" or "single".

所以这意味着将此值添加到 Sigmoid 函数计算的值（之前非常接近 0，因此被视为 0）将意味着它是最接近 0 的值，而不是 0，使得log() 不是 return -Inf.

在学习率为0.1、迭代次数为2000/1000/400的情况下进行测试时，绘制了完整的图形，检查时没有产生NaN值。

注意：以防万一有人想知道，在此之后准确性和 FScores 没有改变，所以尽管在计算更高学习率的成本时出现错误，但准确性确实很好。

为什么逻辑回归中较高的学习率会产生 NaN 成本？

Why do higher learning rates in logistic regression produce NaN costs?

debugging

machine-learning

octave

logistic-regression

总结