Quanteda dfm_lookup 使用多词词典 patterns/expressions

Quanteda dfm_lookup using dictionaries with multi-word patterns/expressions

我正在使用字典来识别语料库中特定单词集的用法。我在字典中包含了多词模式,但是,我认为 dfm_lookup(来自 quanteda 包)不匹配多词表达式。有谁知道如何使用包含多词表达式的字典做与 dfm_lookup 相同的事情?

library(quanteda)

BritainEN <- 
  dictionary(list(identity=c("British", "Great Britain")))


British <- dfm_lookup(debate_dfm,
                       BritishEN,case_insensitive=T)

是 - 在形成 dfm 之前,您需要对标记使用 tokens_lookup()。一旦你标记了单个单词,它们就不再作为你需要匹配字典中的多单词值的有序序列存在。所以 1) 形成令牌对象,2) 使用 tokens_lookup() 将字典应用于令牌,然后 3) 形成 dfm.

library("quanteda")
#> Package version: 1.5.2

BritainEN <- 
    dictionary(list(identity = c("British", "Great Britain")))

txt <- c(doc1 = "Great Britain is a country.",
         doc2 = "British citizens live in Great Britain.")

tokens(txt) %>%
    tokens_lookup(dictionary = BritainEN, exclusive = FALSE)
#> tokens from 2 documents.
#> doc1 :
#> [1] "IDENTITY" "is"       "a"        "country"  "."       
#> 
#> doc2 :
#> [1] "IDENTITY" "citizens" "live"     "in"       "IDENTITY" "."

tokens(txt) %>%
    tokens_lookup(dictionary = BritainEN) %>%
    dfm()
#> Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
#> 2 x 1 sparse Matrix of class "dfm"
#>       features
#> docs   identity
#>   doc1        1
#>   doc2        2

已添加

为了回答额外的评论问题并扩展@phiver 对此非常有用的答案,还有一个 nested_scope 参数设计用于可能出现在另一个 MWE 字典键值中的匹配项。

示例:

library("quanteda")
## Package version: 1.5.2

Ireland_nested <- dictionary(list(
  ie_alone = "Ireland",
  ie_nested = "Northern Ireland"
))

txt <- c(
  doc1 = "Northern Ireland is a country.",
  doc2 = "Some citizens of Ireland live in Northern Ireland."
)

toks <- tokens(txt)

tokens_lookup(toks, dictionary = Ireland_nested, exclusive = FALSE)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "IE_ALONE"  "is"        "a"         "country"   "."        
## 
## doc2 :
## [1] "Some"      "citizens"  "of"        "IE_ALONE"  "live"      "in"       
## [7] "IE_NESTED" "IE_ALONE"  "."
tokens_lookup(toks,
  dictionary = Ireland_nested, nested_scope = "dictionary",
  exclusive = FALSE
)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "is"        "a"         "country"   "."        
## 
## doc2 :
## [1] "Some"      "citizens"  "of"        "IE_ALONE"  "live"      "in"       
## [7] "IE_NESTED" "."

第一个匹配两个键,因为嵌套级别正好在键内,但嵌套模式出现在两个不同的键中。 (在@phiver 中,模式嵌套在键中,在我的示例中它们不是。)当 nested_scope = "dictionary" 时,它会在整个字典中查找嵌套模式匹配,而不仅仅是在键中,因此它不会在我的中重复例子。

你选择哪个取决于你的目的。我们将 quanteda 设计为具有大多数用户想要和期望的默认值,但为有特定需求的用户添加了类似这样的其他选项。 (通常这些需求首先由 Kohei 或我在处理我们自己的特定用例时表达!)

在评论中回答你的问题:

How does this work if the dictionary contains a word which then also appears in a multi-word expression in the dictionary

如果文本包含 "Northern Ireland" 并且词典同时包含 "Northern Ireland" 和 "Ireland",则它只会被计算一次,但 ONLY IF 两者值在同一个字典分组中,就像 Ken 回答中的英国示例一样。

请参阅下面的示例以了解差异。

组合字典示例:

library("quanteda")

Ireland_combined <- 
  dictionary(list(identity = c("Ireland", "Northern Ireland")))

txt <- c(doc1 = "Northern Ireland is a country.",
         doc2 = "Some citizens of Ireland live in Northern Ireland.")

tokens(txt) %>%
  tokens_lookup(dictionary = Ireland_combined , exclusive = FALSE)

# tokens from 2 documents.
# doc1 :
# [1] "IDENTITY" "is"       "a"        "country"  "."       
#
# doc2 :
# [1] "Citizens" "of"       "IDENTITY" "live"     "in"       "IDENTITY" "."  


tokens(txt) %>%
  tokens_lookup(dictionary = Ireland_combined ) %>%
  dfm()

# Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
# 2 x 1 sparse Matrix of class "dfm"
#       features
# docs   identity
#   doc1        1
#   doc2        2

示例单独的字典条目:

Ireland_seperated <- 
  dictionary(list(identity1 = c("Ireland"),
                  identity2 = "Northern Ireland"))

tokens(txt) %>%
  tokens_lookup(dictionary = Ireland_seperated , exclusive = FALSE)

# tokens from 2 documents.
# doc1 :
# [1] "IDENTITY2" "IDENTITY1" "is"        "a"         "country"   "."        
# 
# doc2 :
# [1] "Citizens"  "of"        "IDENTITY1" "live"      "in"        "IDENTITY2" "IDENTITY1" "."      

tokens(txt) %>%
  tokens_lookup(dictionary = Ireland_seperated ) %>%
  dfm()

# Document-feature matrix of: 2 documents, 2 features (0.0% sparse).
# 2 x 2 sparse Matrix of class "dfm"
#       features
# docs   identity1 identity2
#   doc1         1         1
#   doc2         2         1