堆叠式乙状结肠:为什么训练第二层会改变第一层?
stacked sigmoids: why training the second layer alters the first layer?
我正在训练一个神经网络,其中 sigmoid 层层层叠叠。我有与每一层相关联的标签,我想在训练之间交替进行,以尽量减少第一层的损失和尽量减少第二层的损失。我希望无论我是否训练第二层,我在第一层得到的结果都不会改变。但是,我确实有很大的不同。我错过了什么?
代码如下:
dim = Xtrain.shape[1]
output_dim = Ytrain.shape[1]
categories_dim = Ctrain.shape[1]
features = C.input_variable(dim, np.float32)
label = C.input_variable(output_dim, np.float32)
categories = C.input_variable(categories_dim, np.float32)
b = C.parameter(shape=(output_dim))
w = C.parameter(shape=(dim, output_dim))
adv_w = C.parameter(shape=(output_dim, categories_dim))
adv_b = C.parameter(shape=(categories_dim))
pred_parameters = (w, b)
adv_parameters = (adv_w, adv_b)
z = C.tanh(C.times(features, w) + b)
adverse = C.tanh(C.times(z, adv_w) + adv_b)
pred_loss = C.cross_entropy_with_softmax(z, label)
pred_error = C.classification_error(z, label)
adv_loss = C.cross_entropy_with_softmax(adverse, categories)
adv_error = C.classification_error(adverse, categories)
pred_learning_rate = 0.5
pred_lr_schedule = C.learning_rate_schedule(pred_learning_rate, C.UnitType.minibatch)
pred_learner = C.adam(pred_parameters, pred_lr_schedule, C.momentum_as_time_constant_schedule(0.9))
pred_trainer = C.Trainer(adverse, (pred_loss, pred_error), [pred_learner])
adv_learning_rate = 0.5
adv_lr_schedule = C.learning_rate_schedule(adv_learning_rate, C.UnitType.minibatch)
adv_learner = C.adam(adverse.parameters, adv_lr_schedule, C.momentum_as_time_constant_schedule(0.9))
adv_trainer = C.Trainer(adverse, (adv_loss, adv_error), [adv_learner])
minibatch_size = 50
num_of_epocs = 40
# Run the trainer and perform model training
training_progress_output_freq = 50
def permute (x, y, c):
rr = np.arange(x.shape[0])
np.random.shuffle(rr)
x = x[rr, :]
y = y[rr, :]
c = c[rr, :]
return (x, y, c)
for e in range(0,num_of_epocs):
(x, y, c) = permute(Xtrain, Ytrain, Ctrain)
for i in range (0, x.shape[0], minibatch_size):
m_features = x[i:min(i+minibatch_size, x.shape[0]),]
m_labels = y[i:min(i+minibatch_size, x.shape[0]),]
m_cat = c[i:min(i+minibatch_size, x.shape[0]),]
if (e % 2 == 0):
pred_trainer.train_minibatch({features : m_features, label : m_labels, categories : m_cat, diagonal : m_diagonal})
else:
adv_trainer.train_minibatch({features : m_features, label : m_labels, categories : m_cat, diagonal : m_diagonal})
令我惊讶的是,如果我注释掉最后两行(否则:adv_training.train...),z 在预测标签时的训练和测试错误会发生变化。由于 adv_trainer 应该只修改未用于计算 z 或其损失的 adv_w 和 adv_b,我不明白为什么会发生这种情况。感谢您的帮助。
你不应该这样做
adv_learner = C.adam(adverse.parameters, adv_lr_schedule, C.momentum_as_time_constant_schedule(0.9))
但是:
adv_learner = C.adam(adv_parameters, adv_lr_schedule, C.momentum_schedule(0.9))
adverse.parameters 包含所有参数,您不希望这样。另一方面,您需要将 momentum_as_time_constant_schedule
替换为 momentum_schedule
。前者将样本数作为参数,之后梯度的贡献将衰减 exp(-1)。
我正在训练一个神经网络,其中 sigmoid 层层层叠叠。我有与每一层相关联的标签,我想在训练之间交替进行,以尽量减少第一层的损失和尽量减少第二层的损失。我希望无论我是否训练第二层,我在第一层得到的结果都不会改变。但是,我确实有很大的不同。我错过了什么?
代码如下:
dim = Xtrain.shape[1]
output_dim = Ytrain.shape[1]
categories_dim = Ctrain.shape[1]
features = C.input_variable(dim, np.float32)
label = C.input_variable(output_dim, np.float32)
categories = C.input_variable(categories_dim, np.float32)
b = C.parameter(shape=(output_dim))
w = C.parameter(shape=(dim, output_dim))
adv_w = C.parameter(shape=(output_dim, categories_dim))
adv_b = C.parameter(shape=(categories_dim))
pred_parameters = (w, b)
adv_parameters = (adv_w, adv_b)
z = C.tanh(C.times(features, w) + b)
adverse = C.tanh(C.times(z, adv_w) + adv_b)
pred_loss = C.cross_entropy_with_softmax(z, label)
pred_error = C.classification_error(z, label)
adv_loss = C.cross_entropy_with_softmax(adverse, categories)
adv_error = C.classification_error(adverse, categories)
pred_learning_rate = 0.5
pred_lr_schedule = C.learning_rate_schedule(pred_learning_rate, C.UnitType.minibatch)
pred_learner = C.adam(pred_parameters, pred_lr_schedule, C.momentum_as_time_constant_schedule(0.9))
pred_trainer = C.Trainer(adverse, (pred_loss, pred_error), [pred_learner])
adv_learning_rate = 0.5
adv_lr_schedule = C.learning_rate_schedule(adv_learning_rate, C.UnitType.minibatch)
adv_learner = C.adam(adverse.parameters, adv_lr_schedule, C.momentum_as_time_constant_schedule(0.9))
adv_trainer = C.Trainer(adverse, (adv_loss, adv_error), [adv_learner])
minibatch_size = 50
num_of_epocs = 40
# Run the trainer and perform model training
training_progress_output_freq = 50
def permute (x, y, c):
rr = np.arange(x.shape[0])
np.random.shuffle(rr)
x = x[rr, :]
y = y[rr, :]
c = c[rr, :]
return (x, y, c)
for e in range(0,num_of_epocs):
(x, y, c) = permute(Xtrain, Ytrain, Ctrain)
for i in range (0, x.shape[0], minibatch_size):
m_features = x[i:min(i+minibatch_size, x.shape[0]),]
m_labels = y[i:min(i+minibatch_size, x.shape[0]),]
m_cat = c[i:min(i+minibatch_size, x.shape[0]),]
if (e % 2 == 0):
pred_trainer.train_minibatch({features : m_features, label : m_labels, categories : m_cat, diagonal : m_diagonal})
else:
adv_trainer.train_minibatch({features : m_features, label : m_labels, categories : m_cat, diagonal : m_diagonal})
令我惊讶的是,如果我注释掉最后两行(否则:adv_training.train...),z 在预测标签时的训练和测试错误会发生变化。由于 adv_trainer 应该只修改未用于计算 z 或其损失的 adv_w 和 adv_b,我不明白为什么会发生这种情况。感谢您的帮助。
你不应该这样做
adv_learner = C.adam(adverse.parameters, adv_lr_schedule, C.momentum_as_time_constant_schedule(0.9))
但是:
adv_learner = C.adam(adv_parameters, adv_lr_schedule, C.momentum_schedule(0.9))
adverse.parameters 包含所有参数,您不希望这样。另一方面,您需要将 momentum_as_time_constant_schedule
替换为 momentum_schedule
。前者将样本数作为参数,之后梯度的贡献将衰减 exp(-1)。