带插入符号的 Text2Vec 分类 - 朴素贝叶斯警告消息

Text2Vec classification with caret - Naive Bayes warning message

请参阅 了解更多上下文。

我尝试使用文档术语矩阵,使用 text2vec 构建,使用 caret 包训练朴素贝叶斯 (nb) 模型。但是,我收到此警告消息:

Warning message: In eval(xpr, envir = envir) : model fit failed for Fold01.Rep1: usekernel=FALSE, fL=0, adjust=1 Error in NaiveBayes.default(x, y, usekernel = FALSE, fL = param$fL, ...) : Zero variances for at least one class in variables:

请帮助我理解此消息以及我需要采取哪些步骤来避免模型拟合失败。我觉得我需要从 DTM 中删除更多稀疏项,但我不确定。

构建模型的代码:

    control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)

    Train_PRDHA_String.df$Result <- ifelse(Train_PRDHA_String.df$Result == 1, "X", "Y")

    (warn=1)
    (warnings=2)

  t4 = Sys.time()
  svm_nb <- train(x = as.matrix(dtm_train), y = as.factor(Train_PRDHA_String.df$Result),
                  method = "nb",
                  trControl=control,
                  tuneLength = 5,
                  metric ="Accuracy")
print(difftime(Sys.time(), t4, units = 'sec'))

构建文档术语矩阵 (Text2Vec) 的代码:

library(text2vec)
library(data.table)

#Define preprocessing function and tokenization fucntion
preproc_func = tolower
token_func = word_tokenizer

#Union both of the Text fields - learn vocab from both fields
union_txt = c(Train_PRDHA_String.df$MAKTX_Keyword, Train_PRDHA_String.df$PH_Level_04_Description_Keyword)

#Create an iterator over tokens with the itoken() function
it_train = itoken(union_txt, 
                  preprocessor = preproc_func, 
                  tokenizer = token_func, 
                  ids = Train_PRDHA_String.df$ID, 
                  progressbar = TRUE)

#Build Vocabulary
vocab = create_vocabulary(it_train)

vocab

#Dimensional Reduction
pruned_vocab = prune_vocabulary(vocab, 
                                term_count_min = 10, 
                                doc_proportion_max = 0.5,
                                doc_proportion_min = 0.001)
vectorizer = vocab_vectorizer(pruned_vocab)

#Start building a document-term matrix
#vectorizer = vocab_vectorizer(vocab)

#learn vocabulary from Train_PRDHA_String.df$MAKTX_Keyword
it1 = itoken(Train_PRDHA_String.df$MAKTX_Keyword, preproc_func, 
             token_func, ids = Train_PRDHA_String.df$ID)
dtm_train_1 = create_dtm(it1, vectorizer)

#learn vocabulary from Train_PRDHA_String.df$PH_Level_04_Description_Keyword
it2 = itoken(Train_PRDHA_String.df$PH_Level_04_Description_Keyword, preproc_func, 
             token_func, ids = Train_PRDHA_String.df$ID)
dtm_train_2 = create_dtm(it2, vectorizer)

#Combine dtm1 & dtm2 into a single matrix
dtm_train = cbind(dtm_train_1, dtm_train_2)

#Normalise
dtm_train = normalize(dtm_train, "l1")

dim(dtm_train)

也就是说,当这些变量被重采样时,它们只有一个唯一值。您可以使用 preProc = "zv" 来消除警告。这将有助于为这些问题获得一个小的、可重现的例子。