使用 R 加载 MNIST 数字识别数据集并查看任何结果

Question

在书中“Machine Learning - A Probabilistic Perspective" by Kevin P. Murphy 第一个任务是：

Exercise 1.1 KNN classifier on shuffled MNIST data

Run mnist1NNdemo and verify that the misclassification rate (on the first 1000 test cases) of MNIST of a 1-NN classifier is 3.8%. (If you run it all on all 10,000 test cases, the error rate is 3.09%.) Modify the code so that you first randomly permute the features (columns of the training and test design matrices), as in shuffledDigitsDemo, and then apply the classifier. Verify that the error rate is not changed.

我的简单理解是，练习是在加载文件(kNN() in R)后寻找1-NN。

文件：

train-images-idx3-ubyte.gz: training set images (9912422 bytes)

train-labels-idx1-ubyte.gz: training set labels (28881 bytes)

t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)

t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)

取自The MNIST DATABASE

我找到了一个 popular template 用于加载文件：

# for the kNN() function 
library(VIM)
load_mnist <- function() {
  load_image_file <- function(filename) {
   ret = list()
    f = file(filename,'rb')
    readBin(f,'integer',n=1,size=4,endian='big')
    ret$n = readBin(f,'integer',n=1,size=4,endian='big')
    nrow = readBin(f,'integer',n=1,size=4,endian='big')
    ncol = readBin(f,'integer',n=1,size=4,endian='big')
    x = readBin(f,'integer',n=ret$n*nrow*ncol,size=1,signed=F)
    ret$x = matrix(x, ncol=nrow*ncol, byrow=T)
    close(f)
    ret
  }
  load_label_file <- function(filename) {
    f = file(filename,'rb')
    readBin(f,'integer',n=1,size=4,endian='big')
    n = readBin(f,'integer',n=1,size=4,endian='big')
    y = readBin(f,'integer',n=n,size=1,signed=F)
    close(f)
    y
  }
  train <<- load_image_file("train-images.idx3-ubyte")
  test <<- load_image_file("t10k-images.idx3-ubyte")
   
  train$y <<- load_label_file("train-labels.idx1-ubyte")
  test$y <<- load_label_file("t10k-labels.idx1-ubyte")  
}

show_digit <- function(arr784, col=gray(12:1/12)) {
  image(matrix(arr784, nrow=28)[,28:1], col=col)
}

根据评论，在命令行中这应该有效：

  # Error "Error in matrix(arr784, nrow = 28) : object 'train' not found"
  show_digit(train$x[5,])

问题是如何使用 show_digit 功能？

编辑删除多余的问题

Answer 1

我对问题的理解：

首先运行 R Studio 或 ESS 中的整个文件，然后从控制台调用 load_mnist()。之后再次在控制台中执行 show_digit(train$x[3,]) 就可以了。

可以在整个数据集上找到 KNN 分类器： a <- knn(train, test, train$y) 但这将是一个非常缓慢的过程。

结果的预测可以像table(test$y, a)那样进行，test$y是预测的，a是实际的结果。

使用 R 加载 MNIST 数字识别数据集并查看任何结果

Load the MNIST digit recognition dataset with R and see any results

r

machine-learning

mnist