如何将 Mnist 数据集（idx 格式）正确解析为 python 数组？

Question

我是机器学习的新手，我试图避免从 openml 模块下载 mnist 数据集，每次我需要处理 dataset.i 在网上看到这段代码可以帮助我转换 idx 文件到 python 数组，但我的 train_set 标签有问题，它总是缺少 8 个值，我相信这与我转换它的方式有关。

import numpy as np
import struct

with open('train-images.idx3-ubyte', 'rb') as f:
    magic, size = struct.unpack('>II', f.read(8))
    nrows, ncols = struct.unpack('>II', f.read(8))
    data = np.fromfile(f, dtype=np.dtype(np.uint8)).newbyteorder(">")
    data = data.reshape((size,nrows,ncols))

with open('train-labels.idx1-ubyte', 'rb') as i:
    magic, size = struct.unpack('>II', i.read(8))
    nrows, ncols = struct.unpack('>II', i.read(8))
    data_1 = np.fromfile(i, dtype=np.dtype(np.uint8)).newbyteorder(">")    
    
x_train, y_train = data, data_1
len(x_train), len(y_train)

>>> (60000,59992)

如上面的代码所示，这个问题导致我的标签出现错误，因为并非所有火车图像都会被链接correctly.And我已经尝试多次下载该文件以确保我没有获得损坏的文件one.Please, 我需要 help.Thanks

Answer 1

勾选documentation

TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000801(2049) magic number (MSB first)
0004     32 bit integer  60000            number of items
0008     unsigned byte   ??               label
0009     unsigned byte   ??               label
........
xxxx     unsigned byte   ??               label
The labels values are 0 to 9.

前 4 个字节是幻数，接下来是 4 个项目数。之后标签开始。所以你必须跳过 8 个字节才能到达标签。但是您跳过了 16 个字节，跳过了几个标签。

修复

with open('train-labels.idx1-ubyte', 'rb') as i:
    magic, size = struct.unpack('>II', i.read(8))
    data_1 = np.fromfile(i, dtype=np.dtype(np.uint8)).newbyteorder(">")

如何将 Mnist 数据集（idx 格式）正确解析为 python 数组？

How to correctly parse Mnist dataset(idx format) into python arrays?

python

machine-learning

python-3.x

mnist

修复