在 Latent Dirichlet 分配后获得重复项

Question

我正在为 Latent Dirichlet 分配实现尝试这个，但重复了 terms.How 我可以使用 LDA 的唯一术语吗？

library(tm)
Loading required package: NLP
myCorpus <- Corpus(VectorSource(tweets$text))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
removeURL <- function(x) gsub("http[^[:space:]]", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct))
myStopwords <- c(stopwords('english'), "available", "via")
myStopwords <- setdiff(myStopwords, c("r", "big"))
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
myCorpus <- tm_map(myCorpus, stripWhitespace)
myCorpusCopy <- myCorpus
myCorpus <- tm_map(myCorpus, stemDocument)
library('SnowballC')
myCorpus <- tm_map(myCorpus, stemDocument)
dtm<-DocumentTermMatrix(myCorpus)
library("RTextTools", lib.loc="~/R/win-library/3.2")
library("topicmodels", lib.loc="~/R/win-library/3.2")
om1<-LDA(dtm,30)
terms(om1)

Answer 1

根据 https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation 在 LDA 中，每个文档都被视为各种主题的混合体。也就是说，对于每个文档（推文），我们得到推文属于每个主题的概率。概率总和为 1。

同样，每个主题都被视为各种术语（词）的混合体。也就是说，对于每个主题，我们得到属于该主题的每个单词的概率。概率之和为 1。因此，对于每个单词主题组合，都分配了一个概率。代码terms(om1)获取每个主题概率最高的词。

因此，在您的情况下，您发现同一个词在多个主题中的概率最高。这不是错误。

下面的代码将创建 TopicTermdf 数据集，其中包含每个主题的所有单词的分布。查看数据集，将帮助您更好地理解。

下面的代码是基于下面的LDA with topicmodels, how can I see which topics different documents belong to? post.

代码：

# Reproducible data - From Coursera.org John Hopkins Data Science Specialization Capstone project, SwiftKey Challange dataset

tweets <- c("How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.",
           "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.",
           "they've decided its more fun if I don't.",
           "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)",
           "Words from a complete stranger! Made my birthday even better :)",
           "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!",
           "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing",
           "I'm coo... Jus at work hella tired r u ever in cali",
           "The new sundrop commercial ...hehe love at first sight",
           "we need to reconnect THIS WEEK")


library(tm)
myCorpus <- Corpus(VectorSource(tweets))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
removeURL <- function(x) gsub("http[^[:space:]]", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct))
myStopwords <- c(stopwords('english'), "available", "via")
myStopwords <- setdiff(myStopwords, c("r", "big"))
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
myCorpus <- tm_map(myCorpus, stripWhitespace)
myCorpusCopy <- myCorpus
myCorpus <- tm_map(myCorpus, stemDocument)
library('SnowballC')
myCorpus <- tm_map(myCorpus, stemDocument)
dtm<-DocumentTermMatrix(myCorpus)

library(RTextTools)
library(topicmodels)
om1<-LDA(dtm,3)

输出：

> # Get the top word for each topic 
> terms(om1) 
Topic 1 Topic 2 Topic 3 
"youll"   "cub" "anoth" 
> 
> #Top word for each topic
> colnames(TopicTermdf)[apply(TopicTermdf,1,which.max)]
[1] "youll" "cub"   "anoth"

>

Answer 2

尝试找到最佳主题数。为此，您需要构建多个具有不同主题数量的 LDA 模型，并选择其中一个具有最高一致性分数的模型。如果您看到相同的关键字（术语）在多个主题中重复出现，则可能表明 k（主题数）的值太大了。虽然它是用 python 写的，但是这里是 link to LDA topic modeling 你会发现网格搜索方法来找到最优值（决定要采取的主题数量）。

在 Latent Dirichlet 分配后获得重复项

Getting repeated terms after Latent Dirichlet allocation

r

text-mining