时间情感分析——不能按行分组

Question

我的资料正文是一本纯文本的小说。我使用了包 tm 和 tidytext。数据处理很顺利，我毫无困难地创建了我的 DocumentTermMatrix。

text <- read_lines("GoneWithTheWind2.txt")
set.seed(314) 
text <- iconv(text,'UTF-8',sub="")
myCorpus <- tm_map(myCorpus, removeWords, c(stopwords("english"), 
stopwords("SMART"), mystopwords, Top200Words))  
myDtm <- TermDocumentMatrix(myCorpus, control=list(minWordLength= 1))`

但是，我无法运行在 bing 词典和 DocumentTermMatrix 之间使用 inner_join 进行编码来对此进行时间顺序情感分析随着时间的推移小说。我根据在线示例编写了下面的函数，但不知道在 count(sentiment) 中按什么分组（我将 ????? 放在 hold 中），因为纯文本和 DocumentTermMatrix 没有 "lines" 列。

bing <- get_sentiments("bing")  
m <- as.matrix(myDtm)
v <- sort(rowSums(m),decreasing=TRUE)
myNames <- names(v)
d <- data.frame(term=myNames, freq = v)
wind_polarity <- d %>%
# Inner join to the lexicon
inner_join(bing, by=c("term"="word")) %>%
# Count by sentiment, **????**
count(sentiment, **????**) %>%
# Spread sentiments
spread(sentiment, n, fill=0) %>%
mutate(
# Add polarity field
polarity = positive - negative,
# Add line number field
line_number = row_number())
Then plot by ggplot.

我尝试在 text 中添加一列 "Index" 指示每个文档（行）的行号，但此列在该过程中某处消失了。任何建议将不胜感激。

Answer 1

下面是一种计算每条线的极性的方法（基于最少三条线的示例）。您可以直接将 dtm 与词典结合起来以维护计数信息。然后将极性信息转换为数字表示并按行进行计算。您当然可以重写代码并使其更优雅（我对 dplyr 词汇不是很熟悉，抱歉）。无论如何，我希望能有所帮助。

library(tm)
library(tidytext)

text <- c("I like coffe."
          ,"I rather like tea."
          ,"I hate coffee and tea, but I love orange juice.")

myDtm <- TermDocumentMatrix(VCorpus(VectorSource(text)),
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))

bing <- tidytext::get_sentiments("bing")  

wind_polarity <- as.matrix(myDtm) %>%
  data.frame(terms = rownames(myDtm), ., stringsAsFactors = FALSE) %>% 
  inner_join(bing, by= c("terms"="word")) %>%
  mutate(terms = NULL,
         polarity = ifelse( (.[,"sentiment"] == "positive"), 1,-1),
         sentiment = NULL) %>%
  { . * .$polarity } %>% 
  mutate(polarity = NULL) %>% 
  colSums

#the polarity per line which you may plot, e.g., with base or ggplot
# X1 X2 X3 
# 1  1  0

时间情感分析——不能按行分组

Chronological Sentiment Analysis -- Cannot group by lines

r

text-mining

sentiment-analysis