使用caffe模型进行特征提取

Question

我想讨论使用名为 GoggleNet 的 Caffe 模型进行特征提取。我指的是这篇论文"End to end people detection in crowded scenes"。熟悉caffe的朋友，应该能应付我的疑问。

论文有自己的library using Python，我也运行通过了图书馆，但无法应对论文中提到的一些点。

输入图像通过GoogleNet till inception_5b/output层。

然后输出形成15x20x1024的多维数组。因此每个 1024 向量表示 64x64 区域中心的边界框。由于重叠度为 50%，因此 640x480 图像有 15x20 矩阵，每个单元格的第三维长度为 1024 个向量。

我的查询是

(1)如何得到这个15x20x1024的数组输出？

(2)这个1x1x1024的数据如何表示图像中的64x64区域？源码中有描述为

"""Takes the output from the decapitated googlenet and transforms the output
    from a NxCxWxH to (NxWxH)xCx1x1 that is used as input for the lstm layers.
    N = batch size, C = channels, W = grid width, H = grid height."""

该转换是使用Python中的函数实现的

def generate_intermediate_layers(net):
    """Takes the output from the decapitated googlenet and transforms the output
    from a NxCxWxH to (NxWxH)xCx1x1 that is used as input for the lstm layers.
    N = batch size, C = channels, W = grid width, H = grid height."""

    net.f(Convolution("post_fc7_conv", bottoms=["inception_5b/output"],
                      param_lr_mults=[1., 2.], param_decay_mults=[0., 0.],
                      num_output=1024, kernel_dim=(1, 1),
                      weight_filler=Filler("gaussian", 0.005),
                      bias_filler=Filler("constant", 0.)))
    net.f(Power("lstm_fc7_conv", scale=0.01, bottoms=["post_fc7_conv"]))
    net.f(Transpose("lstm_input", bottoms=["lstm_fc7_conv"]))

我无法处理这部分，因为每个 1x1x1024 如何表示边界框矩形的大小。

Answer 1

由于您正在查看网络中非常深的 1x1 单元格，因此它是有效的 recptive field 相当大并且在原始图像中可以是（并且可能是）64x64 像素。
也就是说，"inception_5b/output" 中的每个特征都受到输入图像中 64x64 像素的影响。

使用caffe模型进行特征提取

Feature extraction using caffe model

computer-vision

neural-network

deep-learning

caffe