CSV >> Tensorflow >> 回归(通过神经网络)模型

CSV >> Tensorflow >> regression (via neural network) model

TLDR; 1) 读取 CSV 数据并将其转换为图像,2) 从数据创建回归模型。请注意,我在 2016 年对 python、深度学习和 Whosebug 非常陌生。请投票关闭此功能。我觉得它太过时了。

下面的原始问题...

无尽的谷歌搜索让我在 Python 和 numpy 方面得到了更好的教育,但仍然对解决我的任务一无所知。我想读取 integer/floating 点值的 CSV 并使用神经网络预测值。我找到了几个读取 Iris 数据集并进行分类的示例,但我不明白如何让它们用于回归。有人可以帮我把这些点联系起来吗?

这是输入的一行:

16804,0,1,0,1,1,0,1,0,1,0,1,0,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.490265,0.620805,0.54977,0.869299,0.422268,0.351223,0.33572,0.68308,0.40455,0.47779,0.307628,0.301921,0.318646,0.365993,6135.81

那应该是925个值。最后一列是输出。第一个是 RowID。大多数是二进制值,因为我已经完成了单热编码。测试文件没有 output/last 列。完整的训练文件有大约 1000 万行。一个通用的 MxN 解决方案就可以了。

编辑:让我们使用这个示例数据,因为 Iris 是一个分类问题,但请注意,以上是我的真实目标。我删除了 ID 列。让我们根据其他 6 列来预测最后一列。这有 45 行。 (来源:http://www.stat.ufl.edu/~winner/data/civwar2.dat

100,1861,5,2,3,5,38 112,1863,11,7,4,59.82,15.18 113,1862,34,32,1,79.65,2.65 90,1862,5,2,3,68.89,5.56 93,1862,14,10,4,61.29,17.2 179,1862,22,19,3,62.01,8.89 99,1861,22,16,6,67.68,27.27 111,1862,16,11,4,78.38,8.11 107,1863,17,11,5,60.75,5.61 156,1862,32,30,2,60.9,12.82 152,1862,23,21,2,73.55,6.41 72,1863,7,3,3,54.17,20.83 134,1862,22,21,1,67.91,9.7 180,1862,23,16,4,69.44,3.89 143,1863,23,19,4,81.12,8.39 110,1862,16,12,2,31.82,9.09 157,1862,15,10,5,52.23,24.84 101,1863,4,1,3,58.42,18.81 115,1862,14,11,3,86.96,5.22 103,1862,7,6,1,70.87,0 90,1862,11,11,0,70,4.44 105,1862,20,17,3,80,4.76 104,1862,11,9,1,29.81,9.62 102,1862,17,10,7,49.02,6.86 112,1862,19,14,5,26.79,14.29 87,1862,6,3,3,8.05,72.41 92,1862,4,3,0,11.96,86.96 108,1862,12,7,3,16.67,25 86,1864,0,0,0,2.33,11.63 82,1864,4,3,1,81.71,8.54 76,1864,1,0,1,48.68,6.58 79,1864,0,0,0,15.19,21.52 85,1864,1,1,0,89.41,3.53 85,1864,1,1,0,56.47,0 85,1864,0,0,0,31.76,15.29 87,1864,6,5,0,81.61,3.45 85,1864,5,5,0,72.94,0 83,1864,0,0,0,46.99,2.38 101,1864,5,5,0,1.98,95.05 99,1864,6,6,0,42.42,9.09 10,1864,0,0,0,50,9 98,1864,6,6,0,79.59,3.06 10,1864,0,0,0,71,9 78,1864,5,5,0,70.51,1.28 89,1864,4,4,0,59.55,13.48

让我补充一点,这是一项常见任务,但我读过的任何论坛似乎都没有回答,因此我提出了这个问题。我可以给你我的错误代码,但我不想浪费你的时间在功能不正确的代码上。抱歉,我是这样问的。我只是不了解 API,文档也没有告诉我数据类型。

这是我将 CSV 读入两个 ndarray 的最新代码:

#!/usr/bin/env python
import tensorflow as tf
import csv
import numpy as np
from numpy import genfromtxt

# Build Example Data is CSV format, but use Iris data
from sklearn import datasets
from sklearn.cross_validation import train_test_split
import sklearn
def buildDataFromIris():
    iris = datasets.load_iris()
    data = np.loadtxt(open("t100.csv.out","rb"),delimiter=",",skiprows=0)
    labels = np.copy(data)
    labels = labels[:,924]
    print "labels: ", type (labels), labels.shape, labels.ndim
    data = np.delete(data, [924], axis=1)
    print "data: ", type (data), data.shape, data.ndim

这是我要使用的基本代码。来自的示例也不完整。以下链接中的 API 含糊不清。如果我至少可以弄清楚输入到 DNNRegressor 和文档中的其他数据类型的数据类型,我也许可以编写一些自定义代码。

estimator = DNNRegressor(
    feature_columns=[education_emb, occupation_emb],
    hidden_units=[1024, 512, 256])

# Or estimator using the ProximalAdagradOptimizer optimizer with
# regularization.
estimator = DNNRegressor(
    feature_columns=[education_emb, occupation_emb],
    hidden_units=[1024, 512, 256],
    optimizer=tf.train.ProximalAdagradOptimizer(
      learning_rate=0.1,
      l1_regularization_strength=0.001
    ))

# Input builders
def input_fn_train: # returns x, Y
  pass
estimator.fit(input_fn=input_fn_train)

def input_fn_eval: # returns x, Y
  pass
estimator.evaluate(input_fn=input_fn_eval)
estimator.predict(x=x)

然后最大的问题是如何让它们协同工作。

这是我一直在看的几页。

我发现在过去也很难理解较低级别的 Tensorflow。而且文档并不令人惊奇。如果您专注于掌握 sklearn,您应该会发现使用 skflow 相对容易。 skflowtensorflow 高很多,而且 api 和 sklearn 几乎一样。

现在回答:

作为回归示例,我们将只对鸢尾花数据集执行回归。现在这是一个愚蠢的想法,但它只是为了演示如何使用 DNNRegressor.

Skflow API

第一次使用新的 API 时,请尝试使用尽可能少的参数。你只是想让一些东西工作。所以,我建议你可以这样设置 DNNRegressor

estimator = skflow.DNNRegressor(hidden_units=[16, 16])

我将我的 # 个隐藏单元保持得较小,因为我现在没有太多的计算能力。

然后你给它训练数据,train_X,和训练标签train_y,然后按如下方式拟合它:

estimator.fit(train_X, train_y)

这是所有 sklearn 分类器和回归器的标准程序,skflow 只是将 tensorflow 扩展为类似于 sklearn。我还设置了参数 steps = 10 以便训练在仅运行 10 次迭代时更快地完成。

现在,如果您希望它根据一些新数据进行预测,test_X,您可以按如下方式进行:

pred = estimator.predict(test_X)

同样,这是所有 sklearn 代码的标准程序。就是这样 - skflow 如此简单,你只需要这三行!

train_X和train_y的格式是什么?

如果您不太熟悉机器学习,您的训练数据通常是大小为 M x d 的 ndarray(矩阵),其中有 M 个训练示例和 d 个特征。您的标签是 M x 1(ndarray 的形状 (M,))。

所以你得到的是这样的:

Features:   Sepal Width    Sepal Length ...               Labels
          [   5.1            2.5             ]         [0 (setosa)     ]
  X =     [   2.3            2.4             ]     y = [1 (virginica)  ]
          [   ...             ...            ]         [    ....       ]
          [   1.3            4.5             ]         [2 (Versicolour)]

(注意我只是编造了所有这些数字)。

测试数据将只是一个 N x d 矩阵,其中有 N 个测试示例。测试样例都需要有d个特征。预测函数将接受测试数据和 return 形状为 N x 1 (ndarray of shape (N,))

的测试标签

您没有提供 .csv 文件,所以我会让您将数据解析为该格式。不过方便的是,我们可以使用 sklearn.datsets.load_iris() 来获得我们想要的 Xy。只是

iris = datasets.load_iris()
X = iris.data 
y = iris.target

使用回归器作为分类器

DNNRegressor 的输出将是一组实数(如 1.6789)。但是 iris 数据集有标签 0、1 和 2——Setosa、Versicolour 和 Virginia 的整数 ID。要使用此回归变量执行分类,我们只需将其四舍五入到最接近的标签 (0, 1, 2)。例如,预测 1.6789 将四舍五入为 2。

工作示例

我发现我从一个工作示例中学到了最多。所以这是一个非常简化的工作示例:

如有任何其他问题,请随时post发表评论。

最后我有几个选择。不知道为什么起床这么难运行。首先,这是基于@user2570465 的代码。

import tensorflow as tf
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
import tensorflow.contrib.learn as skflow

def buildDataFromIris():
    iris = datasets.load_iris()
    return iris.data, iris.target

X, y = buildDataFromIris()
feature_cols = tf.contrib.learn.infer_real_valued_columns_from_input(X)
estimator = skflow.DNNRegressor( feature_columns=feature_cols, hidden_units=[10, 10])
train_X, test_X, train_y, test_y = train_test_split(X, y)
estimator.fit(X, y, steps=10)

test_preds = estimator.predict(test_X)

def CalculateAccuracy(X, y):
    continuous_predictions = estimator.predict(X)
    closest_class = []
    for pred in continuous_predictions:
        differences = np.array([abs(pred-1), abs(pred-1), abs(pred-1)])
        closest_class.append(np.argmin(differences))

    num_correct = np.sum(closest_class == y)
    accuracy = float(num_correct)/len(y)
    return accuracy

train_accuracy = CalculateAccuracy(train_X, train_y)
test_accuracy = CalculateAccuracy(test_X, test_y)

print("Train accuracy: %f" % train_accuracy)
print("Test accuracy: %f" % test_accuracy)

其他解决方案从较小的组件构建模型。这是一个计算 Sig(X*W1+b1)*W2+b2 = Y 的片段。Optimizer=Adam, loss=L2, eval=L2 and MSE.

x_train = X[:train_size]
y_train = Y[:train_size]
x_val = X[train_size:]
y_val = Y[train_size:]
print("x_train: {}".format(x_train.shape))

x_train = all_x[:train_size]
print("x_train: {}".format(x_train.shape))
# y_train = func(x_train)
# x_val = all_x[train_size:]
# y_val = func(x_val)

# plt.figure(1)
# plt.scatter(x_train, y_train, c='blue', label='train')
# plt.scatter(x_val, y_val, c='red', label='validation')
# plt.legend()
# plt.savefig("../img/nn_mlp1.png")


#build the model
"""
X = [
"""
X = tf.placeholder(tf.float32, [None, n_input], name = 'X')
Y = tf.placeholder(tf.float32, [None, n_output], name = 'Y')

w_h = tf.Variable(tf.random_uniform([n_input, layer1_neurons], minval=-1, maxval=1, dtype=tf.float32))
b_h = tf.Variable(tf.zeros([1, layer1_neurons], dtype=tf.float32))
h = tf.nn.sigmoid(tf.matmul(X, w_h) + b_h)

w_o = tf.Variable(tf.random_uniform([layer1_neurons, 1], minval=-1, maxval=1, dtype=tf.float32))
b_o = tf.Variable(tf.zeros([1, 1], dtype=tf.float32))
model = tf.matmul(h, w_o) + b_o

train_op = tf.train.AdamOptimizer().minimize(tf.nn.l2_loss(model - Y))
tf.nn.l2_loss(model - Y)

#output = sum((model - Y) ** 2)/2
output = tf.reduce_sum(tf.square(model - Y))/2

#launch the session
sess = tf.Session()
sess.run(tf.initialize_all_variables())

errors = []
for i in range(numEpochs):
    for start, end in zip(range(0, len(x_train), batchSize), range(batchSize, len(x_train), batchSize)):
        sess.run(train_op, feed_dict={X: x_train[start:end], Y: y_train[start:end]})
    cost = sess.run(tf.nn.l2_loss(model - y_val), feed_dict={X: x_val})
    errors.append(cost)
    if i%100 == 0: print("epoch %d, cost = %g" % (i,cost))