如何从测试集的最后一个数据点进行预测
How to make prediction from the last datapoint of test set
我正在进行时间序列预测项目。我的任务是在拥有 1 月至 11 月的数据时预测 12 月的销售额。我将数据分成训练集和测试集。我已经应用 Randomforestregression 来预测测试集。但是,我不知道如何使用该模型来预测 12 月的销售额。你能告诉我怎么做吗?提前谢谢你。
如果您已经完成数据清理,并且已经将它们拆分为 training
和 testing
数据集。您可以简单地将它们放入我创建的 pipline
函数中。这个 generic function
以任何算法和数据作为输入并制作模型,执行 cross-validation 并为 testing
数据集生成预测。
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd
import plotly.plotly as ply
import cufflinks as cf
cf.go_offline()
#Define target and ID columns:
target = 'sales'
IDcol = ['months']
predictors = [x for x in training.columns if x not in [target]+IDcol]
alg = RandomForestRegressor(n_estimators=200,max_depth=5, min_samples_leaf=100,n_jobs=4)
test = modelfitting(alg, training, testing, predictors, target)
coef5 = pd.Series(alg.feature_importances_, predictors).sort_values(ascending=False)
coef5.iplot(kind='bar', title='Feature Importances')
for_plot = test
for_plot = for_plot[['sales prediction']]
for_plot.iplot()
def modelfitting(alg, training, testing, predictors, target):
# Fit the algorithm on the data
alg.fit(training[predictors], training[target])
# Predict training set:
dtrain_predictions = alg.predict(training[predictors])
# Perform cross-validation:
cv_score = cross_val_score(alg, training[predictors], training[target], cv=20, scoring='neg_mean_squared_error')
cv_score = np.sqrt(np.abs(cv_score))
# Print model report:
print "\nModel Report"
print "RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(training[target].values, dtrain_predictions))
print "CV Score : Mean - %.4g | Std - %.4g | Min - %.4g | Max - %.4g" % (
np.mean(cv_score), np.std(cv_score), np.min(cv_score), np.max(cv_score))
# Predict on testing data:
testing["sales prediction"] = alg.predict(testing[predictors])
return testing
我已经发表了 self-explanatory 评论。如果您在理解代码时遇到困难,请随时在评论中讨论。
我正在进行时间序列预测项目。我的任务是在拥有 1 月至 11 月的数据时预测 12 月的销售额。我将数据分成训练集和测试集。我已经应用 Randomforestregression 来预测测试集。但是,我不知道如何使用该模型来预测 12 月的销售额。你能告诉我怎么做吗?提前谢谢你。
如果您已经完成数据清理,并且已经将它们拆分为 training
和 testing
数据集。您可以简单地将它们放入我创建的 pipline
函数中。这个 generic function
以任何算法和数据作为输入并制作模型,执行 cross-validation 并为 testing
数据集生成预测。
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd
import plotly.plotly as ply
import cufflinks as cf
cf.go_offline()
#Define target and ID columns:
target = 'sales'
IDcol = ['months']
predictors = [x for x in training.columns if x not in [target]+IDcol]
alg = RandomForestRegressor(n_estimators=200,max_depth=5, min_samples_leaf=100,n_jobs=4)
test = modelfitting(alg, training, testing, predictors, target)
coef5 = pd.Series(alg.feature_importances_, predictors).sort_values(ascending=False)
coef5.iplot(kind='bar', title='Feature Importances')
for_plot = test
for_plot = for_plot[['sales prediction']]
for_plot.iplot()
def modelfitting(alg, training, testing, predictors, target):
# Fit the algorithm on the data
alg.fit(training[predictors], training[target])
# Predict training set:
dtrain_predictions = alg.predict(training[predictors])
# Perform cross-validation:
cv_score = cross_val_score(alg, training[predictors], training[target], cv=20, scoring='neg_mean_squared_error')
cv_score = np.sqrt(np.abs(cv_score))
# Print model report:
print "\nModel Report"
print "RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(training[target].values, dtrain_predictions))
print "CV Score : Mean - %.4g | Std - %.4g | Min - %.4g | Max - %.4g" % (
np.mean(cv_score), np.std(cv_score), np.min(cv_score), np.max(cv_score))
# Predict on testing data:
testing["sales prediction"] = alg.predict(testing[predictors])
return testing
我已经发表了 self-explanatory 评论。如果您在理解代码时遇到困难,请随时在评论中讨论。