分组深度卷积性能
Grouped depthwise convolution performance
我正在尝试提高我在 Tensorflow 中实施 ResNeXt 的性能。 David Berthelot 提到了对 on twitter 的潜在改进。我想将它应用到我的实现中 - reshape+sum 如何适应它?
# one resnext block per figure 3c
# see also https://arxiv.org/pdf/1611.05431.pdf
def bottleneck(x, strides, dim):
x = tf.layers.conv2d(x, filters=64, kernel_size=1, strides=strides)
x = tf.layers.batch_normalization(x, training=is_training)
x = tf.nn.relu(x)
w = tf.get_variable(name='depthwise_filter', shape=[3, 3, 64, cardinality])
x = tf.nn.depthwise_conv2d_native(x, w, strides=1, padding='same')
x = tf.layers.batch_normalization(x, training=is_training)
x = tf.nn.relu(x)
x = tf.layers.conv2d(x, filters=dim, kernel_size=1, strides=1)
x = tf.layers.batch_normalization(x, training=is_training)
return tf.nn.relu(x)
编辑: 我认为这个实现是正确的,我只需要添加几个操作来提高性能。再看一下 David 的评论,depthwise+reshape+sum 并不是代替单个 depthwise 操作,而是代替其他方法;上面的代码没有计算瓶颈块版本 3d 的等价物。
深度卷积和分组卷积非常相似。分组卷积在多个通道组中应用一组独立的内核,而深度卷积为每个输入通道应用一组独立的内核。至关重要的是,在这两种情况下,输入和输出通道之间的各个连接都使用不与任何其他 input-output 通道对共享的权重。因此,我们可以应用(正如那个人所说的!)重塑和求和来模拟具有深度卷积的分组卷积。这种方法是以内存为代价的,因为我们必须分配一个大数倍的张量来执行中间计算。
深度卷积将单个输入通道映射到多个输出通道,分组卷积将输入通道块映射到输出通道块。如果我们想要应用 32 组 ot 128 通道输入的分组卷积,我们可以改为应用通道乘数为 128/32=4 的深度卷积。输出张量表示等效分组卷积输出的分解版本——深度卷积输出的前 16 个通道对应于分组卷积输出的前 4 个通道。我们可以将这些通道重新整形为一组 4x4 空间,并沿其中一个新轴求和以实现与分组卷积输出的等效。在所有输出通道中,我们只是通过添加两个维度为 4 的新轴进行重塑,求和,然后重塑回 128 个通道。
# one resnext block per figure 3c
# see also https://arxiv.org/pdf/1611.05431.pdf
def bottleneck(x, strides, dim, is_training):
input_channels = x.shape.as_list()[-1]
bottleneck_depth = input_channels // 2
x = tf.layers.conv2d(x, filters=bottleneck_depth, kernel_size=1, strides=strides)
x = tf.layers.batch_normalization(x, training=is_training)
x = tf.nn.relu(x)
group_size = bottleneck_depth // cardinality
w = tf.get_variable(name='depthwise_filter', shape=[3, 3, bottleneck_depth, group_size])
x = tf.nn.depthwise_conv2d_native(x, w, strides=1, padding='same')
depthwise_shape = x.shape.as_list()
x = tf.reshape(x, depthwise_shape[:3] + [cardinality, group_size, group_size])
x = tf.reduce_sum(x, axis=4)
x = tf.reshape(x, depthwise_shape[:3] + [bottleneck_depth])
x = tf.layers.batch_normalization(x, training=is_training)
x = tf.nn.relu(x)
x = tf.layers.conv2d(x, filters=dim, kernel_size=1, strides=1)
x = tf.layers.batch_normalization(x, training=is_training)
return tf.nn.relu(x)
编辑: 看来我没有正确表述 reshape/sum。我更新了上面的代码示例以反映我现在认为的正确转换。旧版本可简化为深度卷积,channel_multiplier
为 1.
我将使用权重固定为 1 的 numpy 来说明不正确和正确的行为,以更好地理解差异。我们将查看具有两组的更简单的 8 通道输入。
input = np.arange(8)
# => [0, 1, 2, 3, 4, 5, 6, 7]
# the result of applying a depthwise convolution with a channel multiplier of 2 and weights fixed at 1
depthwise_output = output.repeat(input, 4)
# => [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, ..., 6, 6, 7, 7, 7, 7]
转换不正确:
x = depthwise_output.reshape((8, 4))
# => [[0, 0, 0, 0],
# [1, 1, 1, 1],
# [2, 2, 2, 2],
# [3, 3, 3, 3],
# [4, 4, 4, 4],
# [5, 5, 5, 5],
# [6, 6, 6, 6],
# [7, 7, 7, 7]]
x = x.sum(axis=1)
# => [ 0, 4, 8, 12, 16, 20, 24, 28]
正确变换:
x = depthwise_output.reshape((2, 4, 4))
# => [[[0, 0, 0, 0],
# [1, 1, 1, 1],
# [2, 2, 2, 2],
# [3, 3, 3, 3]],
#
# [[4, 4, 4, 4],
# [5, 5, 5, 5],
# [6, 6, 6, 6],
# [7, 7, 7, 7]]]
x = x.sum(axis=1)
# => [[ 6, 6, 6, 6],
# [22, 22, 22, 22]])
x = x.reshape((8,))
# => [ 6, 6, 6, 6, 22, 22, 22, 22]
下面是我的实现方式
class LayerCardinalConv(object):
"""Aggregated Residual Transformations for Deep Neural Networks https://arxiv.org/abs/1611.05431"""
def __init__(self, name, w, nin, card, use_bias=True, init='he'):
self.group = nin // card
with tf.name_scope(name):
self.conv = tf.Variable(weight_init(nin, self.group, [*w, nin, self.group], init), name='conv')
self.bias = tf.Variable(tf.zeros([nin]), name='bias') if use_bias else 0
def __call__(self, vin, train):
s = tf.shape(vin)
vout = tf.nn.depthwise_conv2d(vin, self.conv, strides=[1] * 4, padding='SAME')
vout = tf.reshape(vout, [s[0], s[1], s[2], self.group, s[3]])
vout = tf.reduce_sum(vout, 3)
return vout + self.bias
备注:
- w 是内核形状 (3, 3) 例如
- 输入通道数
- 基数或组数
希望对您有所帮助。
我正在尝试提高我在 Tensorflow 中实施 ResNeXt 的性能。 David Berthelot 提到了对 on twitter 的潜在改进。我想将它应用到我的实现中 - reshape+sum 如何适应它?
# one resnext block per figure 3c
# see also https://arxiv.org/pdf/1611.05431.pdf
def bottleneck(x, strides, dim):
x = tf.layers.conv2d(x, filters=64, kernel_size=1, strides=strides)
x = tf.layers.batch_normalization(x, training=is_training)
x = tf.nn.relu(x)
w = tf.get_variable(name='depthwise_filter', shape=[3, 3, 64, cardinality])
x = tf.nn.depthwise_conv2d_native(x, w, strides=1, padding='same')
x = tf.layers.batch_normalization(x, training=is_training)
x = tf.nn.relu(x)
x = tf.layers.conv2d(x, filters=dim, kernel_size=1, strides=1)
x = tf.layers.batch_normalization(x, training=is_training)
return tf.nn.relu(x)
编辑: 我认为这个实现是正确的,我只需要添加几个操作来提高性能。再看一下 David 的评论,depthwise+reshape+sum 并不是代替单个 depthwise 操作,而是代替其他方法;上面的代码没有计算瓶颈块版本 3d 的等价物。
深度卷积和分组卷积非常相似。分组卷积在多个通道组中应用一组独立的内核,而深度卷积为每个输入通道应用一组独立的内核。至关重要的是,在这两种情况下,输入和输出通道之间的各个连接都使用不与任何其他 input-output 通道对共享的权重。因此,我们可以应用(正如那个人所说的!)重塑和求和来模拟具有深度卷积的分组卷积。这种方法是以内存为代价的,因为我们必须分配一个大数倍的张量来执行中间计算。
深度卷积将单个输入通道映射到多个输出通道,分组卷积将输入通道块映射到输出通道块。如果我们想要应用 32 组 ot 128 通道输入的分组卷积,我们可以改为应用通道乘数为 128/32=4 的深度卷积。输出张量表示等效分组卷积输出的分解版本——深度卷积输出的前 16 个通道对应于分组卷积输出的前 4 个通道。我们可以将这些通道重新整形为一组 4x4 空间,并沿其中一个新轴求和以实现与分组卷积输出的等效。在所有输出通道中,我们只是通过添加两个维度为 4 的新轴进行重塑,求和,然后重塑回 128 个通道。
# one resnext block per figure 3c
# see also https://arxiv.org/pdf/1611.05431.pdf
def bottleneck(x, strides, dim, is_training):
input_channels = x.shape.as_list()[-1]
bottleneck_depth = input_channels // 2
x = tf.layers.conv2d(x, filters=bottleneck_depth, kernel_size=1, strides=strides)
x = tf.layers.batch_normalization(x, training=is_training)
x = tf.nn.relu(x)
group_size = bottleneck_depth // cardinality
w = tf.get_variable(name='depthwise_filter', shape=[3, 3, bottleneck_depth, group_size])
x = tf.nn.depthwise_conv2d_native(x, w, strides=1, padding='same')
depthwise_shape = x.shape.as_list()
x = tf.reshape(x, depthwise_shape[:3] + [cardinality, group_size, group_size])
x = tf.reduce_sum(x, axis=4)
x = tf.reshape(x, depthwise_shape[:3] + [bottleneck_depth])
x = tf.layers.batch_normalization(x, training=is_training)
x = tf.nn.relu(x)
x = tf.layers.conv2d(x, filters=dim, kernel_size=1, strides=1)
x = tf.layers.batch_normalization(x, training=is_training)
return tf.nn.relu(x)
编辑: 看来我没有正确表述 reshape/sum。我更新了上面的代码示例以反映我现在认为的正确转换。旧版本可简化为深度卷积,channel_multiplier
为 1.
我将使用权重固定为 1 的 numpy 来说明不正确和正确的行为,以更好地理解差异。我们将查看具有两组的更简单的 8 通道输入。
input = np.arange(8)
# => [0, 1, 2, 3, 4, 5, 6, 7]
# the result of applying a depthwise convolution with a channel multiplier of 2 and weights fixed at 1
depthwise_output = output.repeat(input, 4)
# => [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, ..., 6, 6, 7, 7, 7, 7]
转换不正确:
x = depthwise_output.reshape((8, 4))
# => [[0, 0, 0, 0],
# [1, 1, 1, 1],
# [2, 2, 2, 2],
# [3, 3, 3, 3],
# [4, 4, 4, 4],
# [5, 5, 5, 5],
# [6, 6, 6, 6],
# [7, 7, 7, 7]]
x = x.sum(axis=1)
# => [ 0, 4, 8, 12, 16, 20, 24, 28]
正确变换:
x = depthwise_output.reshape((2, 4, 4))
# => [[[0, 0, 0, 0],
# [1, 1, 1, 1],
# [2, 2, 2, 2],
# [3, 3, 3, 3]],
#
# [[4, 4, 4, 4],
# [5, 5, 5, 5],
# [6, 6, 6, 6],
# [7, 7, 7, 7]]]
x = x.sum(axis=1)
# => [[ 6, 6, 6, 6],
# [22, 22, 22, 22]])
x = x.reshape((8,))
# => [ 6, 6, 6, 6, 22, 22, 22, 22]
下面是我的实现方式
class LayerCardinalConv(object):
"""Aggregated Residual Transformations for Deep Neural Networks https://arxiv.org/abs/1611.05431"""
def __init__(self, name, w, nin, card, use_bias=True, init='he'):
self.group = nin // card
with tf.name_scope(name):
self.conv = tf.Variable(weight_init(nin, self.group, [*w, nin, self.group], init), name='conv')
self.bias = tf.Variable(tf.zeros([nin]), name='bias') if use_bias else 0
def __call__(self, vin, train):
s = tf.shape(vin)
vout = tf.nn.depthwise_conv2d(vin, self.conv, strides=[1] * 4, padding='SAME')
vout = tf.reshape(vout, [s[0], s[1], s[2], self.group, s[3]])
vout = tf.reduce_sum(vout, 3)
return vout + self.bias
备注:
- w 是内核形状 (3, 3) 例如
- 输入通道数
- 基数或组数
希望对您有所帮助。