在 quanteda 中应用字典时提取最重要的正面和负面特征

Question

我有一个包含文本数据的大约 10 万行的数据框。使用 quanteda 包，我应用情感分析（Lexicoder 词典）最终计算情感分数。对于额外的 - 更定性的 - 分析步骤，我想提取最重要的特征（即 negative/positive 字典中在我的数据中出现最频繁的词）来检查话语是否由特定词驱动。

my_corpus <- corpus(my_df, docid_field = "ID", text_field = "my_text", metacorpus = NULL, compress = FALSE)
sentiment_corp <- dfm(my_corpus, dictionary = data_dictionary_LSD2015)

但是，通过 quanteda documentation，我不知道如何实现这一点 - 有什么办法吗？我知道 topfeatures，我也读过，但没有帮助。

Answer 1

在所有采用 pattern 参数的 quanteda 函数中，有效的模式类型是字符向量、列表和字典。因此，评估每个字典类别（我们也称字典 key）中每个顶级特征的最佳方法是在该字典上 select 然后使用 topfeatures().

下面是如何使用 built-in data_corpus_irishbudget2010 对象来执行此操作，例如，使用 Lexicoder 情感词典。

library("quanteda")
## Package version: 1.4.3

# tokenize and select just the dictionary value matches
toks <- tokens(data_corpus_irishbudget2010) %>%
  tokens_select(pattern = data_dictionary_LSD2015)
lapply(toks[1:5], head)
## $`Lenihan, Brian (FF)`
## [1] "severe"        "distress"      "difficulties"  "recovery"     
## [5] "benefit"       "understanding"
## 
## $`Bruton, Richard (FG)`
## [1] "failed"   "warnings" "sucking"  "losses"   "debt"     "hurt"    
## 
## $`Burton, Joan (LAB)`
## [1] "remarkable" "consensus"  "Ireland"    "opposition" "knife"     
## [6] "dispute"   
## 
## $`Morgan, Arthur (SF)`
## [1] "worst"     "worst"     "well"      "corrupt"   "golden"    "protected"
## 
## $`Cowen, Brian (FF)`
## [1] "challenge"      "succeeding"     "challenge"      "oppose"        
## [5] "responsibility" "support"

要探索正面条目的最佳匹配项，我们可以 select 通过为正面键设置字典子集来进一步 select 它们。

# top positive matches
tokens_select(toks, pattern = data_dictionary_LSD2015["positive"]) %>%
  dfm() %>%
  topfeatures()
##    benefit    support   recovery       fair     create confidence 
##         68         52         44         41         39         37 
##    provide       well     credit       help 
##         36         33         31         29

对于否定：

# top negative matches
tokens_select(toks, pattern = data_dictionary_LSD2015[["negative"]]) %>%
  dfm() %>%
  topfeatures()
##    ireland    benefit        not    support     crisis   recovery 
##         79         68         52         52         47         44 
##       fair     create    deficit confidence 
##         41         39         38         37

为什么“爱尔兰”是负匹配？因为 LSD2015 包含 ir* 作为否定词，旨在匹配 ire 和 ireful 但默认匹配不区分大小写，也匹配 Ireland（此示例语料库中经常使用的术语）。这是一个“误报”匹配的例子，当使用通配符或使用英语等多义词和同形异义词率很高的语言时，字典中总是存在风险。

在 quanteda 中应用字典时提取最重要的正面和负面特征

Extract top positive and negative features when applying dictionary in quanteda

dictionary

r

sentiment-analysis

quanteda