如何在 quanteda 中获得情绪分数(并保留情绪词)?

how to get a sentiment score (and keep the sentiment words) in quanteda?

考虑这个简单的例子

library(tibble)
library(quanteda)

tibble(mytext = c('this is a good movie',
                  'oh man this is really bad',
                  'quanteda is great!'))

# A tibble: 3 x 1
  mytext                   
  <chr>                    
1 this is a good movie     
2 oh man this is really bad
3 quanteda is great!   

我想进行一些基本的情绪分析,但有一点不同。这是我的字典,存储在常规 tibble

mydictionary <- tibble(sentiment = c('positive', 'positive','negative'),
                       word = c('good', 'great', 'bad'))

# A tibble: 3 x 2
  sentiment word 
  <chr>     <chr>
1 positive  good 
2 positive  great
3 negative  bad  

本质上,我想计算每个句子中检测到的正面和负面单词的数量,同时还要跟踪匹配的单词。换句话说,输出应该看起来像

                          mytext nb.pos nb.neg   pos.words
1 this is a good and great movie      2      0 good, great
2      oh man this is really bad      0      1         bad
3             quanteda is great!      1      0       great

我如何在 quanteda 中做到这一点?这可能吗? 谢谢!

敬请关注 quanteda v. 2.1,其中我们将大大扩展情绪分析的专用功能。与此同时,见下文。请注意,我做了一些调整,因为您报告的文本和输入文本存在差异,而且 pos.words 中包含所有情感词,而不仅仅是正面词。下面,我计算正面和所有情绪匹配。

# note the amended input text
mytext <- c(
  "this is a good and great movie",
  "oh man this is really bad",
  "quanteda is great!"
)

mydictionary <- tibble::tibble(
  sentiment = c("positive", "positive", "negative"),
  word = c("good", "great", "bad")
)

library("quanteda", warn.conflicts = FALSE)
## Package version: 2.0.9000
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.

# make the dictionary into a quanteda dictionary
qdict <- as.dictionary(mydictionary)

现在我们可以使用查找函数来获得最终结果 data.frame。

# get the sentiment scores
toks <- tokens(mytext)
df <- toks %>%
  tokens_lookup(dictionary = qdict) %>%
  dfm() %>%
  convert(to = "data.frame")
names(df)[2:3] <- c("nb.neg", "nb.pos")

# get the matches for pos and all words
poswords <- tokens_keep(toks, qdict["positive"])
allwords <- tokens_keep(toks, qdict)

data.frame(
  mytext = mytext,
  df[, 2:3],
  pos.words = sapply(poswords, paste, collapse = ", "),
  all.words = sapply(allwords, paste, collapse = ", "),
  row.names = NULL
)
##                           mytext nb.neg nb.pos   pos.words   all.words
## 1 this is a good and great movie      0      2 good, great good, great
## 2      oh man this is really bad      1      0                     bad
## 3             quanteda is great!      0      1       great       great