CSV >> Tensorflow >> 回归(通过神经网络)模型
CSV >> Tensorflow >> regression (via neural network) model
TLDR; 1) 读取 CSV 数据并将其转换为图像,2) 从数据创建回归模型。请注意,我在 2016 年对 python、深度学习和 Whosebug 非常陌生。请投票关闭此功能。我觉得它太过时了。
下面的原始问题...
无尽的谷歌搜索让我在 Python 和 numpy 方面得到了更好的教育,但仍然对解决我的任务一无所知。我想读取 integer/floating 点值的 CSV 并使用神经网络预测值。我找到了几个读取 Iris 数据集并进行分类的示例,但我不明白如何让它们用于回归。有人可以帮我把这些点联系起来吗?
这是输入的一行:
16804,0,1,0,1,1,0,1,0,1,0,1,0,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.490265,0.620805,0.54977,0.869299,0.422268,0.351223,0.33572,0.68308,0.40455,0.47779,0.307628,0.301921,0.318646,0.365993,6135.81
那应该是925个值。最后一列是输出。第一个是 RowID。大多数是二进制值,因为我已经完成了单热编码。测试文件没有 output/last 列。完整的训练文件有大约 1000 万行。一个通用的 MxN 解决方案就可以了。
编辑:让我们使用这个示例数据,因为 Iris 是一个分类问题,但请注意,以上是我的真实目标。我删除了 ID 列。让我们根据其他 6 列来预测最后一列。这有 45 行。 (来源:http://www.stat.ufl.edu/~winner/data/civwar2.dat)
100,1861,5,2,3,5,38
112,1863,11,7,4,59.82,15.18
113,1862,34,32,1,79.65,2.65
90,1862,5,2,3,68.89,5.56
93,1862,14,10,4,61.29,17.2
179,1862,22,19,3,62.01,8.89
99,1861,22,16,6,67.68,27.27
111,1862,16,11,4,78.38,8.11
107,1863,17,11,5,60.75,5.61
156,1862,32,30,2,60.9,12.82
152,1862,23,21,2,73.55,6.41
72,1863,7,3,3,54.17,20.83
134,1862,22,21,1,67.91,9.7
180,1862,23,16,4,69.44,3.89
143,1863,23,19,4,81.12,8.39
110,1862,16,12,2,31.82,9.09
157,1862,15,10,5,52.23,24.84
101,1863,4,1,3,58.42,18.81
115,1862,14,11,3,86.96,5.22
103,1862,7,6,1,70.87,0
90,1862,11,11,0,70,4.44
105,1862,20,17,3,80,4.76
104,1862,11,9,1,29.81,9.62
102,1862,17,10,7,49.02,6.86
112,1862,19,14,5,26.79,14.29
87,1862,6,3,3,8.05,72.41
92,1862,4,3,0,11.96,86.96
108,1862,12,7,3,16.67,25
86,1864,0,0,0,2.33,11.63
82,1864,4,3,1,81.71,8.54
76,1864,1,0,1,48.68,6.58
79,1864,0,0,0,15.19,21.52
85,1864,1,1,0,89.41,3.53
85,1864,1,1,0,56.47,0
85,1864,0,0,0,31.76,15.29
87,1864,6,5,0,81.61,3.45
85,1864,5,5,0,72.94,0
83,1864,0,0,0,46.99,2.38
101,1864,5,5,0,1.98,95.05
99,1864,6,6,0,42.42,9.09
10,1864,0,0,0,50,9
98,1864,6,6,0,79.59,3.06
10,1864,0,0,0,71,9
78,1864,5,5,0,70.51,1.28
89,1864,4,4,0,59.55,13.48
让我补充一点,这是一项常见任务,但我读过的任何论坛似乎都没有回答,因此我提出了这个问题。我可以给你我的错误代码,但我不想浪费你的时间在功能不正确的代码上。抱歉,我是这样问的。我只是不了解 API,文档也没有告诉我数据类型。
这是我将 CSV 读入两个 ndarray 的最新代码:
#!/usr/bin/env python
import tensorflow as tf
import csv
import numpy as np
from numpy import genfromtxt
# Build Example Data is CSV format, but use Iris data
from sklearn import datasets
from sklearn.cross_validation import train_test_split
import sklearn
def buildDataFromIris():
iris = datasets.load_iris()
data = np.loadtxt(open("t100.csv.out","rb"),delimiter=",",skiprows=0)
labels = np.copy(data)
labels = labels[:,924]
print "labels: ", type (labels), labels.shape, labels.ndim
data = np.delete(data, [924], axis=1)
print "data: ", type (data), data.shape, data.ndim
这是我要使用的基本代码。来自的示例也不完整。以下链接中的 API 含糊不清。如果我至少可以弄清楚输入到 DNNRegressor 和文档中的其他数据类型的数据类型,我也许可以编写一些自定义代码。
estimator = DNNRegressor(
feature_columns=[education_emb, occupation_emb],
hidden_units=[1024, 512, 256])
# Or estimator using the ProximalAdagradOptimizer optimizer with
# regularization.
estimator = DNNRegressor(
feature_columns=[education_emb, occupation_emb],
hidden_units=[1024, 512, 256],
optimizer=tf.train.ProximalAdagradOptimizer(
learning_rate=0.1,
l1_regularization_strength=0.001
))
# Input builders
def input_fn_train: # returns x, Y
pass
estimator.fit(input_fn=input_fn_train)
def input_fn_eval: # returns x, Y
pass
estimator.evaluate(input_fn=input_fn_eval)
estimator.predict(x=x)
然后最大的问题是如何让它们协同工作。
这是我一直在看的几页。
- 读取 CSV 并起作用的基本代码(分类器):
https://www.tensorflow.org/versions/r0.11/tutorials/tflearn/index.html
- 回归量:
https://www.tensorflow.org/versions/r0.11/api_docs/python/contrib.learn.html#DNNRegressor
-CSV 阅读:
https://www.tensorflow.org/versions/master/how_tos/reading_data/index.html#csv-files
- 列嵌入:
https://www.tensorflow.org/versions/r0.11/tutorials/wide_and_deep/index.html
- API 列表(DNNRegressor、TensorFlowDNNRegressor、LinearRegressor、
TensorFlowLinearRegressor, TensorFlowRNNRegressor,
张量流回归器):
https://www.tensorflow.org/versions/r0.11/api_docs/python/contrib.learn.html
我发现在过去也很难理解较低级别的 Tensorflow。而且文档并不令人惊奇。如果您专注于掌握 sklearn
,您应该会发现使用 skflow
相对容易。 skflow
比 tensorflow
高很多,而且 api 和 sklearn
几乎一样。
现在回答:
作为回归示例,我们将只对鸢尾花数据集执行回归。现在这是一个愚蠢的想法,但它只是为了演示如何使用 DNNRegressor
.
Skflow API
第一次使用新的 API 时,请尝试使用尽可能少的参数。你只是想让一些东西工作。所以,我建议你可以这样设置 DNNRegressor
:
estimator = skflow.DNNRegressor(hidden_units=[16, 16])
我将我的 # 个隐藏单元保持得较小,因为我现在没有太多的计算能力。
然后你给它训练数据,train_X
,和训练标签train_y
,然后按如下方式拟合它:
estimator.fit(train_X, train_y)
这是所有 sklearn
分类器和回归器的标准程序,skflow
只是将 tensorflow
扩展为类似于 sklearn
。我还设置了参数 steps = 10
以便训练在仅运行 10 次迭代时更快地完成。
现在,如果您希望它根据一些新数据进行预测,test_X
,您可以按如下方式进行:
pred = estimator.predict(test_X)
同样,这是所有 sklearn
代码的标准程序。就是这样 - skflow
如此简单,你只需要这三行!
train_X和train_y的格式是什么?
如果您不太熟悉机器学习,您的训练数据通常是大小为 M x d 的 ndarray
(矩阵),其中有 M 个训练示例和 d 个特征。您的标签是 M x 1(ndarray
的形状 (M,)
)。
所以你得到的是这样的:
Features: Sepal Width Sepal Length ... Labels
[ 5.1 2.5 ] [0 (setosa) ]
X = [ 2.3 2.4 ] y = [1 (virginica) ]
[ ... ... ] [ .... ]
[ 1.3 4.5 ] [2 (Versicolour)]
(注意我只是编造了所有这些数字)。
测试数据将只是一个 N x d 矩阵,其中有 N 个测试示例。测试样例都需要有d个特征。预测函数将接受测试数据和 return 形状为 N x 1 (ndarray
of shape (N,)
)
的测试标签
您没有提供 .csv 文件,所以我会让您将数据解析为该格式。不过方便的是,我们可以使用 sklearn.datsets.load_iris()
来获得我们想要的 X
和 y
。只是
iris = datasets.load_iris()
X = iris.data
y = iris.target
使用回归器作为分类器
DNNRegressor
的输出将是一组实数(如 1.6789)。但是 iris 数据集有标签 0、1 和 2——Setosa、Versicolour 和 Virginia 的整数 ID。要使用此回归变量执行分类,我们只需将其四舍五入到最接近的标签 (0, 1, 2)。例如,预测 1.6789 将四舍五入为 2。
工作示例
我发现我从一个工作示例中学到了最多。所以这是一个非常简化的工作示例:
如有任何其他问题,请随时post发表评论。
最后我有几个选择。不知道为什么起床这么难运行。首先,这是基于@user2570465 的代码。
import tensorflow as tf
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
import tensorflow.contrib.learn as skflow
def buildDataFromIris():
iris = datasets.load_iris()
return iris.data, iris.target
X, y = buildDataFromIris()
feature_cols = tf.contrib.learn.infer_real_valued_columns_from_input(X)
estimator = skflow.DNNRegressor( feature_columns=feature_cols, hidden_units=[10, 10])
train_X, test_X, train_y, test_y = train_test_split(X, y)
estimator.fit(X, y, steps=10)
test_preds = estimator.predict(test_X)
def CalculateAccuracy(X, y):
continuous_predictions = estimator.predict(X)
closest_class = []
for pred in continuous_predictions:
differences = np.array([abs(pred-1), abs(pred-1), abs(pred-1)])
closest_class.append(np.argmin(differences))
num_correct = np.sum(closest_class == y)
accuracy = float(num_correct)/len(y)
return accuracy
train_accuracy = CalculateAccuracy(train_X, train_y)
test_accuracy = CalculateAccuracy(test_X, test_y)
print("Train accuracy: %f" % train_accuracy)
print("Test accuracy: %f" % test_accuracy)
其他解决方案从较小的组件构建模型。这是一个计算 Sig(X*W1+b1)*W2+b2 = Y 的片段。Optimizer=Adam, loss=L2, eval=L2 and MSE.
x_train = X[:train_size]
y_train = Y[:train_size]
x_val = X[train_size:]
y_val = Y[train_size:]
print("x_train: {}".format(x_train.shape))
x_train = all_x[:train_size]
print("x_train: {}".format(x_train.shape))
# y_train = func(x_train)
# x_val = all_x[train_size:]
# y_val = func(x_val)
# plt.figure(1)
# plt.scatter(x_train, y_train, c='blue', label='train')
# plt.scatter(x_val, y_val, c='red', label='validation')
# plt.legend()
# plt.savefig("../img/nn_mlp1.png")
#build the model
"""
X = [
"""
X = tf.placeholder(tf.float32, [None, n_input], name = 'X')
Y = tf.placeholder(tf.float32, [None, n_output], name = 'Y')
w_h = tf.Variable(tf.random_uniform([n_input, layer1_neurons], minval=-1, maxval=1, dtype=tf.float32))
b_h = tf.Variable(tf.zeros([1, layer1_neurons], dtype=tf.float32))
h = tf.nn.sigmoid(tf.matmul(X, w_h) + b_h)
w_o = tf.Variable(tf.random_uniform([layer1_neurons, 1], minval=-1, maxval=1, dtype=tf.float32))
b_o = tf.Variable(tf.zeros([1, 1], dtype=tf.float32))
model = tf.matmul(h, w_o) + b_o
train_op = tf.train.AdamOptimizer().minimize(tf.nn.l2_loss(model - Y))
tf.nn.l2_loss(model - Y)
#output = sum((model - Y) ** 2)/2
output = tf.reduce_sum(tf.square(model - Y))/2
#launch the session
sess = tf.Session()
sess.run(tf.initialize_all_variables())
errors = []
for i in range(numEpochs):
for start, end in zip(range(0, len(x_train), batchSize), range(batchSize, len(x_train), batchSize)):
sess.run(train_op, feed_dict={X: x_train[start:end], Y: y_train[start:end]})
cost = sess.run(tf.nn.l2_loss(model - y_val), feed_dict={X: x_val})
errors.append(cost)
if i%100 == 0: print("epoch %d, cost = %g" % (i,cost))
TLDR; 1) 读取 CSV 数据并将其转换为图像,2) 从数据创建回归模型。请注意,我在 2016 年对 python、深度学习和 Whosebug 非常陌生。请投票关闭此功能。我觉得它太过时了。
下面的原始问题...
无尽的谷歌搜索让我在 Python 和 numpy 方面得到了更好的教育,但仍然对解决我的任务一无所知。我想读取 integer/floating 点值的 CSV 并使用神经网络预测值。我找到了几个读取 Iris 数据集并进行分类的示例,但我不明白如何让它们用于回归。有人可以帮我把这些点联系起来吗?
这是输入的一行:
16804,0,1,0,1,1,0,1,0,1,0,1,0,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,1,1,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.490265,0.620805,0.54977,0.869299,0.422268,0.351223,0.33572,0.68308,0.40455,0.47779,0.307628,0.301921,0.318646,0.365993,6135.81
那应该是925个值。最后一列是输出。第一个是 RowID。大多数是二进制值,因为我已经完成了单热编码。测试文件没有 output/last 列。完整的训练文件有大约 1000 万行。一个通用的 MxN 解决方案就可以了。
编辑:让我们使用这个示例数据,因为 Iris 是一个分类问题,但请注意,以上是我的真实目标。我删除了 ID 列。让我们根据其他 6 列来预测最后一列。这有 45 行。 (来源:http://www.stat.ufl.edu/~winner/data/civwar2.dat)
100,1861,5,2,3,5,38 112,1863,11,7,4,59.82,15.18 113,1862,34,32,1,79.65,2.65 90,1862,5,2,3,68.89,5.56 93,1862,14,10,4,61.29,17.2 179,1862,22,19,3,62.01,8.89 99,1861,22,16,6,67.68,27.27 111,1862,16,11,4,78.38,8.11 107,1863,17,11,5,60.75,5.61 156,1862,32,30,2,60.9,12.82 152,1862,23,21,2,73.55,6.41 72,1863,7,3,3,54.17,20.83 134,1862,22,21,1,67.91,9.7 180,1862,23,16,4,69.44,3.89 143,1863,23,19,4,81.12,8.39 110,1862,16,12,2,31.82,9.09 157,1862,15,10,5,52.23,24.84 101,1863,4,1,3,58.42,18.81 115,1862,14,11,3,86.96,5.22 103,1862,7,6,1,70.87,0 90,1862,11,11,0,70,4.44 105,1862,20,17,3,80,4.76 104,1862,11,9,1,29.81,9.62 102,1862,17,10,7,49.02,6.86 112,1862,19,14,5,26.79,14.29 87,1862,6,3,3,8.05,72.41 92,1862,4,3,0,11.96,86.96 108,1862,12,7,3,16.67,25 86,1864,0,0,0,2.33,11.63 82,1864,4,3,1,81.71,8.54 76,1864,1,0,1,48.68,6.58 79,1864,0,0,0,15.19,21.52 85,1864,1,1,0,89.41,3.53 85,1864,1,1,0,56.47,0 85,1864,0,0,0,31.76,15.29 87,1864,6,5,0,81.61,3.45 85,1864,5,5,0,72.94,0 83,1864,0,0,0,46.99,2.38 101,1864,5,5,0,1.98,95.05 99,1864,6,6,0,42.42,9.09 10,1864,0,0,0,50,9 98,1864,6,6,0,79.59,3.06 10,1864,0,0,0,71,9 78,1864,5,5,0,70.51,1.28 89,1864,4,4,0,59.55,13.48
让我补充一点,这是一项常见任务,但我读过的任何论坛似乎都没有回答,因此我提出了这个问题。我可以给你我的错误代码,但我不想浪费你的时间在功能不正确的代码上。抱歉,我是这样问的。我只是不了解 API,文档也没有告诉我数据类型。
这是我将 CSV 读入两个 ndarray 的最新代码:
#!/usr/bin/env python
import tensorflow as tf
import csv
import numpy as np
from numpy import genfromtxt
# Build Example Data is CSV format, but use Iris data
from sklearn import datasets
from sklearn.cross_validation import train_test_split
import sklearn
def buildDataFromIris():
iris = datasets.load_iris()
data = np.loadtxt(open("t100.csv.out","rb"),delimiter=",",skiprows=0)
labels = np.copy(data)
labels = labels[:,924]
print "labels: ", type (labels), labels.shape, labels.ndim
data = np.delete(data, [924], axis=1)
print "data: ", type (data), data.shape, data.ndim
这是我要使用的基本代码。来自的示例也不完整。以下链接中的 API 含糊不清。如果我至少可以弄清楚输入到 DNNRegressor 和文档中的其他数据类型的数据类型,我也许可以编写一些自定义代码。
estimator = DNNRegressor(
feature_columns=[education_emb, occupation_emb],
hidden_units=[1024, 512, 256])
# Or estimator using the ProximalAdagradOptimizer optimizer with
# regularization.
estimator = DNNRegressor(
feature_columns=[education_emb, occupation_emb],
hidden_units=[1024, 512, 256],
optimizer=tf.train.ProximalAdagradOptimizer(
learning_rate=0.1,
l1_regularization_strength=0.001
))
# Input builders
def input_fn_train: # returns x, Y
pass
estimator.fit(input_fn=input_fn_train)
def input_fn_eval: # returns x, Y
pass
estimator.evaluate(input_fn=input_fn_eval)
estimator.predict(x=x)
然后最大的问题是如何让它们协同工作。
这是我一直在看的几页。
- 读取 CSV 并起作用的基本代码(分类器): https://www.tensorflow.org/versions/r0.11/tutorials/tflearn/index.html
- 回归量: https://www.tensorflow.org/versions/r0.11/api_docs/python/contrib.learn.html#DNNRegressor -CSV 阅读: https://www.tensorflow.org/versions/master/how_tos/reading_data/index.html#csv-files
- 列嵌入: https://www.tensorflow.org/versions/r0.11/tutorials/wide_and_deep/index.html
- API 列表(DNNRegressor、TensorFlowDNNRegressor、LinearRegressor、 TensorFlowLinearRegressor, TensorFlowRNNRegressor, 张量流回归器): https://www.tensorflow.org/versions/r0.11/api_docs/python/contrib.learn.html
我发现在过去也很难理解较低级别的 Tensorflow。而且文档并不令人惊奇。如果您专注于掌握 sklearn
,您应该会发现使用 skflow
相对容易。 skflow
比 tensorflow
高很多,而且 api 和 sklearn
几乎一样。
现在回答:
作为回归示例,我们将只对鸢尾花数据集执行回归。现在这是一个愚蠢的想法,但它只是为了演示如何使用 DNNRegressor
.
Skflow API
第一次使用新的 API 时,请尝试使用尽可能少的参数。你只是想让一些东西工作。所以,我建议你可以这样设置 DNNRegressor
:
estimator = skflow.DNNRegressor(hidden_units=[16, 16])
我将我的 # 个隐藏单元保持得较小,因为我现在没有太多的计算能力。
然后你给它训练数据,train_X
,和训练标签train_y
,然后按如下方式拟合它:
estimator.fit(train_X, train_y)
这是所有 sklearn
分类器和回归器的标准程序,skflow
只是将 tensorflow
扩展为类似于 sklearn
。我还设置了参数 steps = 10
以便训练在仅运行 10 次迭代时更快地完成。
现在,如果您希望它根据一些新数据进行预测,test_X
,您可以按如下方式进行:
pred = estimator.predict(test_X)
同样,这是所有 sklearn
代码的标准程序。就是这样 - skflow
如此简单,你只需要这三行!
train_X和train_y的格式是什么?
如果您不太熟悉机器学习,您的训练数据通常是大小为 M x d 的 ndarray
(矩阵),其中有 M 个训练示例和 d 个特征。您的标签是 M x 1(ndarray
的形状 (M,)
)。
所以你得到的是这样的:
Features: Sepal Width Sepal Length ... Labels
[ 5.1 2.5 ] [0 (setosa) ]
X = [ 2.3 2.4 ] y = [1 (virginica) ]
[ ... ... ] [ .... ]
[ 1.3 4.5 ] [2 (Versicolour)]
(注意我只是编造了所有这些数字)。
测试数据将只是一个 N x d 矩阵,其中有 N 个测试示例。测试样例都需要有d个特征。预测函数将接受测试数据和 return 形状为 N x 1 (ndarray
of shape (N,)
)
您没有提供 .csv 文件,所以我会让您将数据解析为该格式。不过方便的是,我们可以使用 sklearn.datsets.load_iris()
来获得我们想要的 X
和 y
。只是
iris = datasets.load_iris()
X = iris.data
y = iris.target
使用回归器作为分类器
DNNRegressor
的输出将是一组实数(如 1.6789)。但是 iris 数据集有标签 0、1 和 2——Setosa、Versicolour 和 Virginia 的整数 ID。要使用此回归变量执行分类,我们只需将其四舍五入到最接近的标签 (0, 1, 2)。例如,预测 1.6789 将四舍五入为 2。
工作示例
我发现我从一个工作示例中学到了最多。所以这是一个非常简化的工作示例:
如有任何其他问题,请随时post发表评论。
最后我有几个选择。不知道为什么起床这么难运行。首先,这是基于@user2570465 的代码。
import tensorflow as tf
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
import tensorflow.contrib.learn as skflow
def buildDataFromIris():
iris = datasets.load_iris()
return iris.data, iris.target
X, y = buildDataFromIris()
feature_cols = tf.contrib.learn.infer_real_valued_columns_from_input(X)
estimator = skflow.DNNRegressor( feature_columns=feature_cols, hidden_units=[10, 10])
train_X, test_X, train_y, test_y = train_test_split(X, y)
estimator.fit(X, y, steps=10)
test_preds = estimator.predict(test_X)
def CalculateAccuracy(X, y):
continuous_predictions = estimator.predict(X)
closest_class = []
for pred in continuous_predictions:
differences = np.array([abs(pred-1), abs(pred-1), abs(pred-1)])
closest_class.append(np.argmin(differences))
num_correct = np.sum(closest_class == y)
accuracy = float(num_correct)/len(y)
return accuracy
train_accuracy = CalculateAccuracy(train_X, train_y)
test_accuracy = CalculateAccuracy(test_X, test_y)
print("Train accuracy: %f" % train_accuracy)
print("Test accuracy: %f" % test_accuracy)
其他解决方案从较小的组件构建模型。这是一个计算 Sig(X*W1+b1)*W2+b2 = Y 的片段。Optimizer=Adam, loss=L2, eval=L2 and MSE.
x_train = X[:train_size]
y_train = Y[:train_size]
x_val = X[train_size:]
y_val = Y[train_size:]
print("x_train: {}".format(x_train.shape))
x_train = all_x[:train_size]
print("x_train: {}".format(x_train.shape))
# y_train = func(x_train)
# x_val = all_x[train_size:]
# y_val = func(x_val)
# plt.figure(1)
# plt.scatter(x_train, y_train, c='blue', label='train')
# plt.scatter(x_val, y_val, c='red', label='validation')
# plt.legend()
# plt.savefig("../img/nn_mlp1.png")
#build the model
"""
X = [
"""
X = tf.placeholder(tf.float32, [None, n_input], name = 'X')
Y = tf.placeholder(tf.float32, [None, n_output], name = 'Y')
w_h = tf.Variable(tf.random_uniform([n_input, layer1_neurons], minval=-1, maxval=1, dtype=tf.float32))
b_h = tf.Variable(tf.zeros([1, layer1_neurons], dtype=tf.float32))
h = tf.nn.sigmoid(tf.matmul(X, w_h) + b_h)
w_o = tf.Variable(tf.random_uniform([layer1_neurons, 1], minval=-1, maxval=1, dtype=tf.float32))
b_o = tf.Variable(tf.zeros([1, 1], dtype=tf.float32))
model = tf.matmul(h, w_o) + b_o
train_op = tf.train.AdamOptimizer().minimize(tf.nn.l2_loss(model - Y))
tf.nn.l2_loss(model - Y)
#output = sum((model - Y) ** 2)/2
output = tf.reduce_sum(tf.square(model - Y))/2
#launch the session
sess = tf.Session()
sess.run(tf.initialize_all_variables())
errors = []
for i in range(numEpochs):
for start, end in zip(range(0, len(x_train), batchSize), range(batchSize, len(x_train), batchSize)):
sess.run(train_op, feed_dict={X: x_train[start:end], Y: y_train[start:end]})
cost = sess.run(tf.nn.l2_loss(model - y_val), feed_dict={X: x_val})
errors.append(cost)
if i%100 == 0: print("epoch %d, cost = %g" % (i,cost))