向前看并向后看不适用于 quanteda 字典

Look ahead and look behind not working for quanteda dictionary

我正在尝试建立一个包含许多重叠术语的 quanteda 词典。我相信在后面使用 regex look ahead/look 可能是解决这个问题并避免错误命中的一种方法,但我一定是做错了什么。

text <- c("guinea", "equatorial guinea", "guinea bissau")
dict <- dictionary(list(guinea="guinea"))
dfm <- dfm(text, dictionary=dict, valuetype="regex")
colSums(dfm)              
dict2 <- dictionary(list(guinea="(?<!equatorial[[:space:]])guinea"))
dfm2 <- dfm(text, dictionary=dict2, valuetype="regex")
colSums(dfm2)
dict3 <- dictionary(list(guinea="guinea(?![[:space:]]bissau)"))
dfm3 <- dfm(text, dictionary=dict3, valuetype="regex")
colSums(dfm3)

预期结果应该是

# dfm1
colSums(dfm1)
guinea 
     3 
# dfm2
colSums(dfm2)
guinea 
     2
# dfm3 
colSums(dfm3)
guinea 
     2 

但实际结果都是=3 这是外观 ahead/behind 的问题还是空白 space 的插入方式的问题?

这种正则表达式匹配不起作用,因为模式不能跨越多个标记,在 dfm(x, dictionary = ...) 调用中,它实际上是在对文本进行标记后调用 tokens_lookup()

有一种更简单的方法可以做到这一点,只需在字典中包含 multi-word 值即可。所以:

library("quanteda")
## Package version: 1.4.3

text <- c("guinea", "equatorial guinea", "guinea bissau")

dict <- dictionary(list(guinea = "guinea"))
dict2 <- dictionary(list(guinea = "equatorial guinea"))
dict3 <- dictionary(list(guinea = "guinea bissau"))

dfm(text, dictionary = dict)
## Document-feature matrix of: 3 documents, 1 feature (0.0% sparse).
## 3 x 1 sparse Matrix of class "dfm"
##        features
## docs    guinea
##   text1      1
##   text2      1
##   text3      1

dfm(text, dictionary = dict2)
## Document-feature matrix of: 3 documents, 1 feature (66.7% sparse).
## 3 x 1 sparse Matrix of class "dfm"
##        features
## docs    guinea
##   text1      0
##   text2      1
##   text3      0

dfm(text, dictionary = dict3)
## Document-feature matrix of: 3 documents, 1 feature (66.7% sparse).
## 3 x 1 sparse Matrix of class "dfm"
##        features
## docs    guinea
##   text1      0
##   text2      0
##   text3      1