如何将情感分析结果(dfm)与 Quanteda 中的原始 readtext 对象合并?
How to merge sentiment analysis results (dfm) with original readtext object in Quanteda?
我一直在使用 Quanteda 的基本 tokens_lookup
功能和 Young Soroka Sentiment Dictionary 来计算政客推文中正面和负面词语的数量。
获得结果后,有没有一种方法可以将这些列添加回具有各种 docvar 的原始 readtext 对象?
head(dat)
readtext object consisting of 6 documents and 11 docvars.
# Description: df[,13] [6 × 13]
doc_id text date username to replies retweets favorites geo mentions hashtags id permalink
* <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <lgl> <chr> <chr> <dbl> <chr>
1 trump.c… "\"Sleepy… 2020-05-… realDonal… MZHemi… 5415 13062 39680 NA @AjitPaiF… "" 1.84e-224 https://twitter.com/rea…
2 trump.c… "\"He got… 2020-05-… realDonal… mikand… 20406 39081 111370 NA "" "" 1.84e-224 https://twitter.com/rea…
3 trump.c… "\"Thank … 2020-05-… realDonal… mikand… 5733 17293 66992 NA "" "" 1.84e-224 https://twitter.com/rea…
4 trump.c… "\".@CBS … 2020-05-… realDonal… "" 22215 25834 93625 NA @CBS @60M… "" 1.83e-224 https://twitter.com/rea…
5 trump.c… "\"This b… 2020-05-… realDonal… GreggJ… 5379 11403 39869 NA "" "" 1.81e-224 https://twitter.com/rea…
6 trump.c… "\"OBAMAG… 2020-05-… realDonal… "" 55960 89664 320171 NA "" "" 1.81e-224 https://twitter.com/rea…
> corp <- corpus(dat)
> toks <- tokens(corp, remove_punct = TRUE)
> toks_lsd <- tokens_lookup(toks, dictionary = data_dictionary_LSD2015[1:2])
> dfmat_lsd <- dfm(toks_lsd)
> head(dfmat_lsd)
Document-feature matrix of: 6 documents, 2 features (66.7% sparse).
6 x 2 sparse Matrix of class "dfm"
features
docs negative positive
trump.csv.1 2 0
trump.csv.2 0 0
trump.csv.3 0 1
trump.csv.4 2 1
trump.csv.5 0 0
trump.csv.6 0 0
我已经尝试从 readtext 对象中获取所需的列并用它们创建一个新的 data.frame,这没问题,但如果我能将 dfm 结果合并回其他数据。
您需要做的只是将 dfm 转换为 data.frame 并合并。
dat2 <- cbind(data, convert(dfmat_lsd, to = 'data.frame'))
或者,要确保文档顺序与原始文档顺序一致,您可以合并两个数据集:
library(tidyverse)
data_sentiment <- convert(dfm, to = "data.frame") %>% rename(doc_id = document)
dat2 <- left_join(dat, data_sentiment, by = "doc_id")
我一直在使用 Quanteda 的基本 tokens_lookup
功能和 Young Soroka Sentiment Dictionary 来计算政客推文中正面和负面词语的数量。
获得结果后,有没有一种方法可以将这些列添加回具有各种 docvar 的原始 readtext 对象?
head(dat)
readtext object consisting of 6 documents and 11 docvars.
# Description: df[,13] [6 × 13]
doc_id text date username to replies retweets favorites geo mentions hashtags id permalink
* <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <lgl> <chr> <chr> <dbl> <chr>
1 trump.c… "\"Sleepy… 2020-05-… realDonal… MZHemi… 5415 13062 39680 NA @AjitPaiF… "" 1.84e-224 https://twitter.com/rea…
2 trump.c… "\"He got… 2020-05-… realDonal… mikand… 20406 39081 111370 NA "" "" 1.84e-224 https://twitter.com/rea…
3 trump.c… "\"Thank … 2020-05-… realDonal… mikand… 5733 17293 66992 NA "" "" 1.84e-224 https://twitter.com/rea…
4 trump.c… "\".@CBS … 2020-05-… realDonal… "" 22215 25834 93625 NA @CBS @60M… "" 1.83e-224 https://twitter.com/rea…
5 trump.c… "\"This b… 2020-05-… realDonal… GreggJ… 5379 11403 39869 NA "" "" 1.81e-224 https://twitter.com/rea…
6 trump.c… "\"OBAMAG… 2020-05-… realDonal… "" 55960 89664 320171 NA "" "" 1.81e-224 https://twitter.com/rea…
> corp <- corpus(dat)
> toks <- tokens(corp, remove_punct = TRUE)
> toks_lsd <- tokens_lookup(toks, dictionary = data_dictionary_LSD2015[1:2])
> dfmat_lsd <- dfm(toks_lsd)
> head(dfmat_lsd)
Document-feature matrix of: 6 documents, 2 features (66.7% sparse).
6 x 2 sparse Matrix of class "dfm"
features
docs negative positive
trump.csv.1 2 0
trump.csv.2 0 0
trump.csv.3 0 1
trump.csv.4 2 1
trump.csv.5 0 0
trump.csv.6 0 0
我已经尝试从 readtext 对象中获取所需的列并用它们创建一个新的 data.frame,这没问题,但如果我能将 dfm 结果合并回其他数据。
您需要做的只是将 dfm 转换为 data.frame 并合并。
dat2 <- cbind(data, convert(dfmat_lsd, to = 'data.frame'))
或者,要确保文档顺序与原始文档顺序一致,您可以合并两个数据集:
library(tidyverse)
data_sentiment <- convert(dfm, to = "data.frame") %>% rename(doc_id = document)
dat2 <- left_join(dat, data_sentiment, by = "doc_id")