在主题建模中找到每个句子中的主导主题

Question

我在 R 中找不到答案的一个问题是如何为每个句子找到 NLP 模型中的主导主题？想象一下我有这样的数据框：

comment <- c("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior",
             "solidly constructed lovingly maintained sf crest built",
             "one year since built new this well designed storey home",
             "beautiful street large bdm in the heart of lynn valley over sqft bathrooms",
             "rare to find legal beautiful upgr in port moody centre with a mountain view all bedroom units were nicely renovated",
             "fantastic opportunity to get value for the money excellent family home in desirable blueridge with legal selfcontained bachelor suite on the main floor great location close to swimming ice skating community",
             "original owner tired but rock solid perfect location half a block to norquay elementary school and short quiet blocks to slocan park and sky train station")

id <- c(1,2,3,4,5,6,7)

data <- data.frame(id, comment)

我做如下所示的预处理：

text_cleaning_tokens <- data %>% 
  tidytext::unnest_tokens(word, comment)
text_cleaning_tokens$word <- gsub('[[:digit:]]+', '', text_cleaning_tokens$word)
text_cleaning_tokens$word <- gsub('[[:punct:]]+', '', text_cleaning_tokens$word)


text_cleaning_tokens <- text_cleaning_tokens %>% filter(!(nchar(word) == 1))%>% 
  anti_join(stop_words)

stemmed_token <- text_cleaning_tokens %>% mutate(word=wordStem(word))


tokens <- stemmed_token %>% filter(!(word==""))
tokens <- tokens %>% mutate(ind = row_number())
tokens <- tokens %>% group_by(id) %>% mutate(ind = row_number()) %>%
  tidyr::spread(key = ind, value = word)
tokens [is.na(tokens)] <- ""
tokens <- tidyr::unite(tokens, clean_remark,-id,sep =" " )
tokens$clean_remark <- trimws(tokens$clean_remark)

I 运行 FitLdaModel 函数对此数据进行处理，最后根据 2 个组找到最佳主题：

             t_1            t_2
1         beauti          built
2          block           home
3          renov          legal
4       bathroom          locat
5            bdm       bachelor
6      bdm_heart  bachelor_suit
7  beauti_street  block_norquai
8    beauti_upgr       blueridg
9        bedroom blueridg_legal
10  bedroom_unit   built_design

现在根据我的结果，我想在主题建模中找到每个句子中最主要的主题。例如，我想知道对于评论 1（“出色的翻新，所有改进都是最重要的，并且考虑到低月度公用事业甚至内部的能源效率”），哪个主题（主题 1 或主题 2）是最主要的？

谁能帮我解答这个问题？我们有可以做到这一点的软件包吗？

Answer 1

使用 quanteda 和 topicmodels 非常容易。前者用于文本数据的数据管理和量化分析，后者用于主题建模推理。

在这里，我将您的 comment 对象转换为 corpus，然后转换为 dfm。然后我将其转换为 topicmodels.

可以理解的

函数 LDA() 为您提供轻松提取信息所需的一切。特别是，使用 get_topics() 您可以获得每个文档最可能的主题。如果您想要查看文档主题权重，您可以使用 ldamodel@gamma 来实现。您会看到 get_topics() 完全符合您的要求。

请看看这是否适合你。

library(quanteda)
#> Package version: 2.1.2
#> Parallel computing: 2 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View
library(topicmodels)


comment <- c("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior",
             "solidly constructed lovingly maintained sf crest built",
             "one year since built new this well designed storey home",
             "beautiful street large bdm in the heart of lynn valley over sqft bathrooms",
             "rare to find legal beautiful upgr in port moody centre with a mountain view all bedroom units were nicely renovated",
             "fantastic opportunity to get value for the money excellent family home in desirable blueridge with legal selfcontained bachelor suite on the main floor great location close to swimming ice skating community",
             "original owner tired but rock solid perfect location half a block to norquay elementary school and short quiet blocks to slocan park and sky train station")

mycorp <- corpus(comment)
docvars(mycorp, "id") <- 1L:7L

mydfm <- dfm(mycorp)

# convert the DFM to a Document Matrix for topicmodels
forTM <- convert(mydfm, to = "topicmodels")

myLDA <- LDA(forTM, k = 2)

dominant_topics <- get_topics(myLDA)
dominant_topics
#> text1 text2 text3 text4 text5 text6 text7 
#>     2     2     2     2     1     1     1

dtw <- myLDA@gamma
dtw
#>           [,1]      [,2]
#> [1,] 0.4870600 0.5129400
#> [2,] 0.4994974 0.5005026
#> [3,] 0.4980144 0.5019856
#> [4,] 0.4938985 0.5061015
#> [5,] 0.5037667 0.4962333
#> [6,] 0.5000727 0.4999273
#> [7,] 0.5176960 0.4823040

^{由 reprex package (v1.0.0)}

于 2021 年 3 月 18 日创建

Answer 2

我同意另一个答案，quanteda 和 topicmodels 是更好的选择。也许还可以查看 seededlda，这是来自 quanteda 作者之一的 LDA 实现（具有您不必使用的额外功能）。

但是，如果您想坚持选择 tidytext 和 textmineR，这就是您的选择。

首先，我稍微简化了你的预处理，因为你做了一些对我来说似乎没有必要的步骤：

library(tidyverse)
library(tidytext)

text_cleaning_tokens <- data %>% 
  unnest_tokens(word, comment) %>% 
  mutate(word = str_remove(word, "[[:digit:]]|[[:punct:]]")) %>% 
  filter(!(nchar(word) <= 1))%>% 
  anti_join(stop_words, by = "word") %>% 
  mutate(word = SnowballC::wordStem(word))

然后我运行 LDA根据textmineR例子：

lda <- text_cleaning_tokens %>% 
  cast_sparse(id, word) %>% 
  textmineR::FitLdaModel(k = 2,
                         iterations = 200,
                         burnin = 175,
                         optimize_alpha = TRUE,
                         calc_likelihood = TRUE,
                         calc_r2 = TRUE)

现在 LDA 的所有实现都提供了两个重要结果：

phi (φ) 显示语料库中每个词在每个主题上的得分。 phi 值越高，该特定主题中的单词越普遍。
theta (θ) 显示语料库中每个文档在每个主题上的得分情况。 theta 值越高，主题在文档中越普遍。（topicmodels 出于某种原因称之为伽马。）

换句话说，要找到文本中最主要的主题，您要做的就是：

lda$theta %>% 
  as_tibble() %>% 
  rowwise() %>% 
  mutate(top = which.max(c_across(everything()))) %>% # find highest value per row dplyr style
  bind_cols(data, .) %>% # bind to original data
  as_tibble() # just for nicer printing
#> # A tibble: 7 x 5
#>      id comment                                                t_1     t_2   top
#>   <int> <chr>                                                <dbl>   <dbl> <int>
#> 1     1 1 . outstanding renovation all improvements are t… 0.892   0.108       1
#> 2     2 solidly constructed lovingly maintained sf crest … 0.0161  0.984       2
#> 3     3 one year since built new this well designed store… 0.0238  0.976       2
#> 4     4 beautiful street large bdm in the heart of lynn v… 0.986   0.0139      1
#> 5     5 rare to find legal beautiful upgr in port moody c… 0.992   0.00820     1
#> 6     6 fantastic opportunity to get value for the money … 0.266   0.734       2
#> 7     7 original owner tired but rock solid perfect locat… 0.00549 0.995       2

^{由 reprex package (v1.0.0)}

于 2021 年 3 月 18 日创建

我还建议您阅读 Julia Silge 关于此事的文章。例如，this and this.

在主题建模中找到每个句子中的主导主题

Finding the dominant topic in each sentence in topic modeling

nlp

r

text-mining

topic-modeling