如何在 quanteda 的朴素贝叶斯中计算 PcGw?
how is PcGw computed in quanteda's Naive Bayes?
考虑从 An Introduction to Information Retrieval
https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
的 13.1 复制示例的常用示例
txt <- c(d1 = "Chinese Beijing Chinese",
d2 = "Chinese Chinese Shanghai",
d3 = "Chinese Macao",
d4 = "Tokyo Japan Chinese",
d5 = "Chinese Chinese Chinese Tokyo Japan")
trainingset <- dfm(txt, tolower = FALSE)
trainingclass <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE)
tmod1 <- textmodel_nb(trainingset, y = trainingclass, prior = "docfreq")
根据文档,PcGw
是 posterior class probability given the word
。它是如何计算的?我以为我们关心的是相反的方式,即 P(word / class)
.
> tmod1$PcGw
features
classes Chinese Beijing Shanghai Macao Tokyo Japan
N 0.1473684 0.2058824 0.2058824 0.2058824 0.5090909 0.5090909
Y 0.8526316 0.7941176 0.7941176 0.7941176 0.4909091 0.4909091
谢谢!
应用程序在你引用的书章节中解释得很清楚,但本质上不同的是PcGw是"probability of the class given the word",而PwGc是"probability of the word given the class"。前者是后验概率,我们需要使用联合概率(在 quanteda 中使用 class 计算一组词的成员概率 class =12=]函数)。后者只是来自每个 class 中特征的相对频率的可能性,默认情况下通过 class.
将计数加一来平滑
如果你想验证这个,你可以按如下方式验证。首先通过训练class对训练文档进行分组,然后平滑
trainingset_bygroup <- dfm_group(trainingset[1:4, ], trainingclass[-5]) %>%
dfm_smooth(smoothing = 1)
trainingset_bygroup
# Document-feature matrix of: 2 documents, 6 features (0.0% sparse).
# 2 x 6 sparse Matrix of class "dfm"
# features
# docs Chinese Beijing Shanghai Macao Tokyo Japan
# N 2 1 1 1 2 2
# Y 6 2 2 2 1 1
然后你可以看到(平滑的)单词似然与 PwGc
相同。
trainingset_bygroup / rowSums(trainingset_bygroup)
# Document-feature matrix of: 2 documents, 6 features (0.0% sparse).
# 2 x 6 sparse Matrix of class "dfm"
# features
# docs Chinese Beijing Shanghai Macao Tokyo Japan
# N 0.2222222 0.1111111 0.1111111 0.1111111 0.22222222 0.22222222
# Y 0.4285714 0.1428571 0.1428571 0.1428571 0.07142857 0.07142857
tmod1$PwGc
# features
# classes Chinese Beijing Shanghai Macao Tokyo Japan
# N 0.2222222 0.1111111 0.1111111 0.1111111 0.22222222 0.22222222
# Y 0.4285714 0.1428571 0.1428571 0.1428571 0.07142857 0.07142857
但您可能更关心 P(class|word),因为这就是贝叶斯公式的全部内容,并结合了先验 class 概率 P(c)。
考虑从 An Introduction to Information Retrieval
https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
txt <- c(d1 = "Chinese Beijing Chinese",
d2 = "Chinese Chinese Shanghai",
d3 = "Chinese Macao",
d4 = "Tokyo Japan Chinese",
d5 = "Chinese Chinese Chinese Tokyo Japan")
trainingset <- dfm(txt, tolower = FALSE)
trainingclass <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE)
tmod1 <- textmodel_nb(trainingset, y = trainingclass, prior = "docfreq")
根据文档,PcGw
是 posterior class probability given the word
。它是如何计算的?我以为我们关心的是相反的方式,即 P(word / class)
.
> tmod1$PcGw
features
classes Chinese Beijing Shanghai Macao Tokyo Japan
N 0.1473684 0.2058824 0.2058824 0.2058824 0.5090909 0.5090909
Y 0.8526316 0.7941176 0.7941176 0.7941176 0.4909091 0.4909091
谢谢!
应用程序在你引用的书章节中解释得很清楚,但本质上不同的是PcGw是"probability of the class given the word",而PwGc是"probability of the word given the class"。前者是后验概率,我们需要使用联合概率(在 quanteda 中使用 class 计算一组词的成员概率 class =12=]函数)。后者只是来自每个 class 中特征的相对频率的可能性,默认情况下通过 class.
将计数加一来平滑如果你想验证这个,你可以按如下方式验证。首先通过训练class对训练文档进行分组,然后平滑
trainingset_bygroup <- dfm_group(trainingset[1:4, ], trainingclass[-5]) %>%
dfm_smooth(smoothing = 1)
trainingset_bygroup
# Document-feature matrix of: 2 documents, 6 features (0.0% sparse).
# 2 x 6 sparse Matrix of class "dfm"
# features
# docs Chinese Beijing Shanghai Macao Tokyo Japan
# N 2 1 1 1 2 2
# Y 6 2 2 2 1 1
然后你可以看到(平滑的)单词似然与 PwGc
相同。
trainingset_bygroup / rowSums(trainingset_bygroup)
# Document-feature matrix of: 2 documents, 6 features (0.0% sparse).
# 2 x 6 sparse Matrix of class "dfm"
# features
# docs Chinese Beijing Shanghai Macao Tokyo Japan
# N 0.2222222 0.1111111 0.1111111 0.1111111 0.22222222 0.22222222
# Y 0.4285714 0.1428571 0.1428571 0.1428571 0.07142857 0.07142857
tmod1$PwGc
# features
# classes Chinese Beijing Shanghai Macao Tokyo Japan
# N 0.2222222 0.1111111 0.1111111 0.1111111 0.22222222 0.22222222
# Y 0.4285714 0.1428571 0.1428571 0.1428571 0.07142857 0.07142857
但您可能更关心 P(class|word),因为这就是贝叶斯公式的全部内容,并结合了先验 class 概率 P(c)。