我们应该如何解释 H2O 预测函数的结果?
How should we interpret the results of the H2O predict function?
我已经训练并存储了一个随机森林二元分类模型。现在,我正在尝试使用此模型模拟处理新的(样本外)数据。我的 Python (Anaconda 3.6) 代码是:
import h2o
import pandas as pd
import sys
localH2O = h2o.init(ip = "localhost", port = 54321, max_mem_size = "8G", nthreads = -1)
h2o.remove_all()
model_path = "C:/sm/BottleRockets/rf_model/DRF_model_python_1501621766843_28117";
model = h2o.load_model(model_path)
new_data = h2o.import_file(path="C:/sm/BottleRockets/new_data.csv")
print(new_data.head(10))
predict = model.predict(new_data) # predict returns a data frame
print(predict.describe())
predicted = predict[0,0]
probability = predict[0,2] # probability the prediction is a "1"
print('prediction: ', predicted, ', probability: ', probability)
当我 运行 此代码时,我得到:
>>> import h2o
>>> import pandas as pd
>>> import sys
>>> localH2O = h2o.init(ip = "localhost", port = 54321, max_mem_size = "8G", nthreads = -1)
Checking whether there is an H2O instance running at http://localhost:54321. connected.
-------------------------- ------------------------------
H2O cluster uptime: 22 hours 22 mins
H2O cluster version: 3.10.5.4
H2O cluster version age: 18 days
H2O cluster name: H2O_from_python_Charles_0fqq0c
H2O cluster total nodes: 1
H2O cluster free memory: 6.790 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: locked, healthy
H2O connection url: http://localhost:54321
H2O connection proxy:
H2O internal security: False
Python version: 3.6.1 final
-------------------------- ------------------------------
>>> h2o.remove_all()
>>> model_path = "C:/sm/BottleRockets/rf_model/DRF_model_python_1501621766843_28117";
>>> model = h2o.load_model(model_path)
>>> new_data = h2o.import_file(path="C:/sm/BottleRockets/new_data.csv")
Parse progress: |█████████████████████████████████████████████████████████| 100%
>>> print(new_data.head(10))
BoxRatio Thrust Velocity OnBalRun vwapGain
---------- -------- ---------- ---------- ----------
1.502 55.044 0.38 37 0.845
[1 row x 5 columns]
>>> predict = model.predict(new_data) # predict returns a data frame
drf prediction progress: |████████████████████████████████████████████████| 100%
>>> print(predict.describe())
Rows:1
Cols:3
predict p0 p1
------- --------- ------------------ -------------------
type enum real real
mins 0.8849431818181818 0.11505681818181818
mean 0.8849431818181818 0.11505681818181818
maxs 0.8849431818181818 0.11505681818181818
sigma 0.0 0.0
zeros 0 0
missing 0 0 0
0 1 0.8849431818181818 0.11505681818181818
None
>>> predicted = predict[0,0]
>>> probability = predict[0,2] # probability the prediction is a "1"
>>> print('prediction: ', predicted, ', probability: ', probability)
prediction: 1 , probability: 0.11505681818181818
>>>
我对 "predict" 数据框的内容感到困惑。请告诉我标有 "p0" 和 "p1" 的列中的数字是什么意思。我希望它们是概率,正如您在我的代码中看到的那样,我正在尝试获得预测的分类(0 或 1)以及该分类正确的概率。我的代码是否正确地做到了这一点?
如有任何意见,我们将不胜感激。
查尔斯
p0 是 class 0 被选中的概率(介于 0 和 1 之间)。
p1 是 class 1 被选中的概率(介于 0 和 1 之间)。
要记住的是 "prediction" 是通过对 p1 应用阈值得到的。该阈值点的选择取决于您是要减少误报还是漏报。不只是 0.5.
为"the prediction"选择的阈值是max-F1。但是您可以自己提取 p1 并以任何您喜欢的方式设置阈值。
Darren Cook 让我 post 训练数据的前几行。这是:
BoxRatio Thrust Velocity OnBalRun vwapGain Altitude
0 0.000 0.000 2.186 4.534 0.361 1
1 0.000 0.000 0.561 2.642 0.909 1
2 2.824 2.824 2.199 4.748 1.422 1
3 0.442 0.452 1.702 3.695 1.186 0
4 0.084 0.088 0.612 1.699 0.700 1
响应列标记为 "Altitude"。 Class 1 是我想从新的 "out-of-sample" 数据中看到的。 “1”很好,表示达到了 "Altitude"(真阳性)。 “0”表示未达到 "Altitude"(真否定)。在上面的预测table中,预测“1”的概率为0.11505681818181818。这对我来说没有意义。
查尔斯
我已经训练并存储了一个随机森林二元分类模型。现在,我正在尝试使用此模型模拟处理新的(样本外)数据。我的 Python (Anaconda 3.6) 代码是:
import h2o
import pandas as pd
import sys
localH2O = h2o.init(ip = "localhost", port = 54321, max_mem_size = "8G", nthreads = -1)
h2o.remove_all()
model_path = "C:/sm/BottleRockets/rf_model/DRF_model_python_1501621766843_28117";
model = h2o.load_model(model_path)
new_data = h2o.import_file(path="C:/sm/BottleRockets/new_data.csv")
print(new_data.head(10))
predict = model.predict(new_data) # predict returns a data frame
print(predict.describe())
predicted = predict[0,0]
probability = predict[0,2] # probability the prediction is a "1"
print('prediction: ', predicted, ', probability: ', probability)
当我 运行 此代码时,我得到:
>>> import h2o
>>> import pandas as pd
>>> import sys
>>> localH2O = h2o.init(ip = "localhost", port = 54321, max_mem_size = "8G", nthreads = -1)
Checking whether there is an H2O instance running at http://localhost:54321. connected.
-------------------------- ------------------------------
H2O cluster uptime: 22 hours 22 mins
H2O cluster version: 3.10.5.4
H2O cluster version age: 18 days
H2O cluster name: H2O_from_python_Charles_0fqq0c
H2O cluster total nodes: 1
H2O cluster free memory: 6.790 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: locked, healthy
H2O connection url: http://localhost:54321
H2O connection proxy:
H2O internal security: False
Python version: 3.6.1 final
-------------------------- ------------------------------
>>> h2o.remove_all()
>>> model_path = "C:/sm/BottleRockets/rf_model/DRF_model_python_1501621766843_28117";
>>> model = h2o.load_model(model_path)
>>> new_data = h2o.import_file(path="C:/sm/BottleRockets/new_data.csv")
Parse progress: |█████████████████████████████████████████████████████████| 100%
>>> print(new_data.head(10))
BoxRatio Thrust Velocity OnBalRun vwapGain
---------- -------- ---------- ---------- ----------
1.502 55.044 0.38 37 0.845
[1 row x 5 columns]
>>> predict = model.predict(new_data) # predict returns a data frame
drf prediction progress: |████████████████████████████████████████████████| 100%
>>> print(predict.describe())
Rows:1
Cols:3
predict p0 p1
------- --------- ------------------ -------------------
type enum real real
mins 0.8849431818181818 0.11505681818181818
mean 0.8849431818181818 0.11505681818181818
maxs 0.8849431818181818 0.11505681818181818
sigma 0.0 0.0
zeros 0 0
missing 0 0 0
0 1 0.8849431818181818 0.11505681818181818
None
>>> predicted = predict[0,0]
>>> probability = predict[0,2] # probability the prediction is a "1"
>>> print('prediction: ', predicted, ', probability: ', probability)
prediction: 1 , probability: 0.11505681818181818
>>>
我对 "predict" 数据框的内容感到困惑。请告诉我标有 "p0" 和 "p1" 的列中的数字是什么意思。我希望它们是概率,正如您在我的代码中看到的那样,我正在尝试获得预测的分类(0 或 1)以及该分类正确的概率。我的代码是否正确地做到了这一点?
如有任何意见,我们将不胜感激。 查尔斯
p0 是 class 0 被选中的概率(介于 0 和 1 之间)。
p1 是 class 1 被选中的概率(介于 0 和 1 之间)。
要记住的是 "prediction" 是通过对 p1 应用阈值得到的。该阈值点的选择取决于您是要减少误报还是漏报。不只是 0.5.
为"the prediction"选择的阈值是max-F1。但是您可以自己提取 p1 并以任何您喜欢的方式设置阈值。
Darren Cook 让我 post 训练数据的前几行。这是:
BoxRatio Thrust Velocity OnBalRun vwapGain Altitude
0 0.000 0.000 2.186 4.534 0.361 1
1 0.000 0.000 0.561 2.642 0.909 1
2 2.824 2.824 2.199 4.748 1.422 1
3 0.442 0.452 1.702 3.695 1.186 0
4 0.084 0.088 0.612 1.699 0.700 1
响应列标记为 "Altitude"。 Class 1 是我想从新的 "out-of-sample" 数据中看到的。 “1”很好,表示达到了 "Altitude"(真阳性)。 “0”表示未达到 "Altitude"(真否定)。在上面的预测table中,预测“1”的概率为0.11505681818181818。这对我来说没有意义。
查尔斯