调用值为 1 的 exit 的 C50 代码（使用非空值的因子决策变量）

Question

我读到与此问题类似的 post related，但恐怕此错误代码是由其他原因引起的。我有一个包含 8 个观测值和 10 个变量的 CSV 文件：

 > str(rorIn)

'data.frame':   8 obs. of  10 variables:
 $ Acuity             : Factor w/ 3 levels "Elective  ","Emergency ",..: 1 1 2 2 1 2 2 3
 $ AgeInYears         : int  49 56 77 65 51 79 67 63
 $ IsPriority         : int  0 0 1 0 0 1 0 1
 $ AuthorizationStatus: Factor w/ 1 level "APPROVED  ": 1 1 1 1 1 1 1 1
 $ iscasemanagement   : Factor w/ 2 levels "N","Y": 1 1 2 1 1 2 2 2
 $ iseligible         : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1
 $ referralservicecode: Factor w/ 4 levels "12345","278",..: 4 1 3 1 1 2 3 1
 $ IsHighlight        : Factor w/ 1 level "N": 1 1 1 1 1 1 1 1
 $ RealLengthOfStay   : int  25 1 1 1 2 2 1 3
 $ Readmit            : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 2 1

我这样调用算法：

library("C50")
rorIn <- read.csv(file = "RoRdataInputData_v1.6.csv", header = TRUE, quote = "\"")
rorIn$Readmit <- factor(rorIn$Readmit)
fit <- C5.0(Readmit~., data= rorIn)

然后我得到：

> source("~/R-workspace/src/RoR/RoR/testing.R")
c50 code called exit with value 1
>

我正在遵循其他建议，例如： - 使用一个因素作为决策变量 - 避免空数据

对此有任何帮助吗？我读到这是机器学习的最佳算法之一，但我总是遇到此错误。

这是原始数据集：

Acuity,AgeInYears,IsPriority,AuthorizationStatus,iscasemanagement,iseligible,referralservicecode,IsHighlight,RealLengthOfStay,Readmit
Elective  ,49,0,APPROVED  ,N,Y,SNF            ,N,25,1
Elective  ,56,0,APPROVED  ,N,Y,12345,N,1,0
Emergency ,77,1,APPROVED  ,Y,Y,OBSERVE        ,N,1,1
Emergency ,65,0,APPROVED  ,N,Y,12345,N,1,0
Elective  ,51,0,APPROVED  ,N,Y,12345,N,2,1
Emergency ,79,1,APPROVED  ,Y,Y,278,N,2,0
Emergency ,67,0,APPROVED  ,Y,Y,OBSERVE        ,N,1,1
Urgent    ,63,1,APPROVED  ,Y,Y,12345,N,3,0

在此先感谢您的帮助，

大卫

Answer 1

您需要通过几种方式清理数据。

删除不需要的列，只有一级。它们不包含任何信息并会导致问题。
将目标变量rorIn$Readmit的class转换为因数
将目标变量与您为训练提供的数据集分开。

这应该有效：

rorIn <- read.csv("RoRdataInputData_v1.6.csv", header=TRUE) 
rorIn$Readmit <- as.factor(rorIn$Readmit)
library(Hmisc)
singleLevelVars <- names(rorIn)[contents(rorIn)$contents$Levels == 1]
trainvars <- setdiff(colnames(rorIn), c("Readmit", singleLevelVars))
library(C50)
RoRmodel <- C5.0(rorIn[,trainvars], rorIn$Readmit,trials = 10)
predict(RoRmodel, rorIn[,trainvars])
#[1] 1 0 1 0 0 0 1 0
#Levels: 0 1

然后您可以通过将此预测结果与目标变量的实际值进行比较来评估准确性、召回率和其他统计数据：

rorIn$Readmit
#[1] 1 0 1 0 1 0 1 0
#Levels: 0 1

通常的方法是建立一个混淆矩阵来比较二进制class化问题中的实际值和预测值。在这个小数据集的情况下，可以很容易地看出只有一个假阴性结果。因此代码似乎运行良好，但由于观察的数量非常少，这个令人鼓舞的结果可能 具有欺骗性。

library(gmodels) actual <- rorIn$Readmit predicted <- predict(RoRmodel,rorIn[,trainvars]) CrossTable(actual,predicted, prop.chisq=FALSE,prop.r=FALSE) # Total Observations in Table: 8 # # # | predicted # actual | 0 | 1 | Row Total | #--------------|-----------|-----------|-----------| # 0 | 4 | 0 | 4 | # | 0.800 | 0.000 | | # | 0.500 | 0.000 | | #--------------|-----------|-----------|-----------| # 1 | 1 | 3 | 4 | # | 0.200 | 1.000 | | # | 0.125 | 0.375 | | #--------------|-----------|-----------|-----------| # Column Total | 5 | 3 | 8 | # | 0.625 | 0.375 | | #--------------|-----------|-----------|-----------|

在更大的数据集上，如果没有必要，将数据集分成训练数据和测试数据会很有用。有很多关于机器学习的优秀文献可以帮助您微调模型及其预测。

调用值为 1 的 exit 的 C50 代码（使用非空值的因子决策变量）

C50 code called exit with value 1 (using factor decision variable a non empty values)

r

machine-learning

decision-tree