Quanteda dfm_lookup 使用多词词典 patterns/expressions
Quanteda dfm_lookup using dictionaries with multi-word patterns/expressions
我正在使用字典来识别语料库中特定单词集的用法。我在字典中包含了多词模式,但是,我认为 dfm_lookup(来自 quanteda 包)不匹配多词表达式。有谁知道如何使用包含多词表达式的字典做与 dfm_lookup 相同的事情?
library(quanteda)
BritainEN <-
dictionary(list(identity=c("British", "Great Britain")))
British <- dfm_lookup(debate_dfm,
BritishEN,case_insensitive=T)
是 - 在形成 dfm 之前,您需要对标记使用 tokens_lookup()
。一旦你标记了单个单词,它们就不再作为你需要匹配字典中的多单词值的有序序列存在。所以 1) 形成令牌对象,2) 使用 tokens_lookup()
将字典应用于令牌,然后 3) 形成 dfm.
library("quanteda")
#> Package version: 1.5.2
BritainEN <-
dictionary(list(identity = c("British", "Great Britain")))
txt <- c(doc1 = "Great Britain is a country.",
doc2 = "British citizens live in Great Britain.")
tokens(txt) %>%
tokens_lookup(dictionary = BritainEN, exclusive = FALSE)
#> tokens from 2 documents.
#> doc1 :
#> [1] "IDENTITY" "is" "a" "country" "."
#>
#> doc2 :
#> [1] "IDENTITY" "citizens" "live" "in" "IDENTITY" "."
tokens(txt) %>%
tokens_lookup(dictionary = BritainEN) %>%
dfm()
#> Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
#> 2 x 1 sparse Matrix of class "dfm"
#> features
#> docs identity
#> doc1 1
#> doc2 2
已添加
为了回答额外的评论问题并扩展@phiver 对此非常有用的答案,还有一个 nested_scope
参数设计用于可能出现在另一个 MWE 字典键值中的匹配项。
示例:
library("quanteda")
## Package version: 1.5.2
Ireland_nested <- dictionary(list(
ie_alone = "Ireland",
ie_nested = "Northern Ireland"
))
txt <- c(
doc1 = "Northern Ireland is a country.",
doc2 = "Some citizens of Ireland live in Northern Ireland."
)
toks <- tokens(txt)
tokens_lookup(toks, dictionary = Ireland_nested, exclusive = FALSE)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "IE_ALONE" "is" "a" "country" "."
##
## doc2 :
## [1] "Some" "citizens" "of" "IE_ALONE" "live" "in"
## [7] "IE_NESTED" "IE_ALONE" "."
tokens_lookup(toks,
dictionary = Ireland_nested, nested_scope = "dictionary",
exclusive = FALSE
)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "is" "a" "country" "."
##
## doc2 :
## [1] "Some" "citizens" "of" "IE_ALONE" "live" "in"
## [7] "IE_NESTED" "."
第一个匹配两个键,因为嵌套级别正好在键内,但嵌套模式出现在两个不同的键中。 (在@phiver 中,模式嵌套在键中,在我的示例中它们不是。)当 nested_scope = "dictionary"
时,它会在整个字典中查找嵌套模式匹配,而不仅仅是在键中,因此它不会在我的中重复例子。
你选择哪个取决于你的目的。我们将 quanteda 设计为具有大多数用户想要和期望的默认值,但为有特定需求的用户添加了类似这样的其他选项。 (通常这些需求首先由 Kohei 或我在处理我们自己的特定用例时表达!)
在评论中回答你的问题:
How does this work if the dictionary contains a word which then also
appears in a multi-word expression in the dictionary
如果文本包含 "Northern Ireland" 并且词典同时包含 "Northern Ireland" 和 "Ireland",则它只会被计算一次,但 ONLY IF 两者值在同一个字典分组中,就像 Ken 回答中的英国示例一样。
请参阅下面的示例以了解差异。
组合字典示例:
library("quanteda")
Ireland_combined <-
dictionary(list(identity = c("Ireland", "Northern Ireland")))
txt <- c(doc1 = "Northern Ireland is a country.",
doc2 = "Some citizens of Ireland live in Northern Ireland.")
tokens(txt) %>%
tokens_lookup(dictionary = Ireland_combined , exclusive = FALSE)
# tokens from 2 documents.
# doc1 :
# [1] "IDENTITY" "is" "a" "country" "."
#
# doc2 :
# [1] "Citizens" "of" "IDENTITY" "live" "in" "IDENTITY" "."
tokens(txt) %>%
tokens_lookup(dictionary = Ireland_combined ) %>%
dfm()
# Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
# 2 x 1 sparse Matrix of class "dfm"
# features
# docs identity
# doc1 1
# doc2 2
示例单独的字典条目:
Ireland_seperated <-
dictionary(list(identity1 = c("Ireland"),
identity2 = "Northern Ireland"))
tokens(txt) %>%
tokens_lookup(dictionary = Ireland_seperated , exclusive = FALSE)
# tokens from 2 documents.
# doc1 :
# [1] "IDENTITY2" "IDENTITY1" "is" "a" "country" "."
#
# doc2 :
# [1] "Citizens" "of" "IDENTITY1" "live" "in" "IDENTITY2" "IDENTITY1" "."
tokens(txt) %>%
tokens_lookup(dictionary = Ireland_seperated ) %>%
dfm()
# Document-feature matrix of: 2 documents, 2 features (0.0% sparse).
# 2 x 2 sparse Matrix of class "dfm"
# features
# docs identity1 identity2
# doc1 1 1
# doc2 2 1
我正在使用字典来识别语料库中特定单词集的用法。我在字典中包含了多词模式,但是,我认为 dfm_lookup(来自 quanteda 包)不匹配多词表达式。有谁知道如何使用包含多词表达式的字典做与 dfm_lookup 相同的事情?
library(quanteda)
BritainEN <-
dictionary(list(identity=c("British", "Great Britain")))
British <- dfm_lookup(debate_dfm,
BritishEN,case_insensitive=T)
是 - 在形成 dfm 之前,您需要对标记使用 tokens_lookup()
。一旦你标记了单个单词,它们就不再作为你需要匹配字典中的多单词值的有序序列存在。所以 1) 形成令牌对象,2) 使用 tokens_lookup()
将字典应用于令牌,然后 3) 形成 dfm.
library("quanteda")
#> Package version: 1.5.2
BritainEN <-
dictionary(list(identity = c("British", "Great Britain")))
txt <- c(doc1 = "Great Britain is a country.",
doc2 = "British citizens live in Great Britain.")
tokens(txt) %>%
tokens_lookup(dictionary = BritainEN, exclusive = FALSE)
#> tokens from 2 documents.
#> doc1 :
#> [1] "IDENTITY" "is" "a" "country" "."
#>
#> doc2 :
#> [1] "IDENTITY" "citizens" "live" "in" "IDENTITY" "."
tokens(txt) %>%
tokens_lookup(dictionary = BritainEN) %>%
dfm()
#> Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
#> 2 x 1 sparse Matrix of class "dfm"
#> features
#> docs identity
#> doc1 1
#> doc2 2
已添加
为了回答额外的评论问题并扩展@phiver 对此非常有用的答案,还有一个 nested_scope
参数设计用于可能出现在另一个 MWE 字典键值中的匹配项。
示例:
library("quanteda")
## Package version: 1.5.2
Ireland_nested <- dictionary(list(
ie_alone = "Ireland",
ie_nested = "Northern Ireland"
))
txt <- c(
doc1 = "Northern Ireland is a country.",
doc2 = "Some citizens of Ireland live in Northern Ireland."
)
toks <- tokens(txt)
tokens_lookup(toks, dictionary = Ireland_nested, exclusive = FALSE)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "IE_ALONE" "is" "a" "country" "."
##
## doc2 :
## [1] "Some" "citizens" "of" "IE_ALONE" "live" "in"
## [7] "IE_NESTED" "IE_ALONE" "."
tokens_lookup(toks,
dictionary = Ireland_nested, nested_scope = "dictionary",
exclusive = FALSE
)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "IE_NESTED" "is" "a" "country" "."
##
## doc2 :
## [1] "Some" "citizens" "of" "IE_ALONE" "live" "in"
## [7] "IE_NESTED" "."
第一个匹配两个键,因为嵌套级别正好在键内,但嵌套模式出现在两个不同的键中。 (在@phiver 中,模式嵌套在键中,在我的示例中它们不是。)当 nested_scope = "dictionary"
时,它会在整个字典中查找嵌套模式匹配,而不仅仅是在键中,因此它不会在我的中重复例子。
你选择哪个取决于你的目的。我们将 quanteda 设计为具有大多数用户想要和期望的默认值,但为有特定需求的用户添加了类似这样的其他选项。 (通常这些需求首先由 Kohei 或我在处理我们自己的特定用例时表达!)
在评论中回答你的问题:
How does this work if the dictionary contains a word which then also appears in a multi-word expression in the dictionary
如果文本包含 "Northern Ireland" 并且词典同时包含 "Northern Ireland" 和 "Ireland",则它只会被计算一次,但 ONLY IF 两者值在同一个字典分组中,就像 Ken 回答中的英国示例一样。
请参阅下面的示例以了解差异。
组合字典示例:
library("quanteda")
Ireland_combined <-
dictionary(list(identity = c("Ireland", "Northern Ireland")))
txt <- c(doc1 = "Northern Ireland is a country.",
doc2 = "Some citizens of Ireland live in Northern Ireland.")
tokens(txt) %>%
tokens_lookup(dictionary = Ireland_combined , exclusive = FALSE)
# tokens from 2 documents.
# doc1 :
# [1] "IDENTITY" "is" "a" "country" "."
#
# doc2 :
# [1] "Citizens" "of" "IDENTITY" "live" "in" "IDENTITY" "."
tokens(txt) %>%
tokens_lookup(dictionary = Ireland_combined ) %>%
dfm()
# Document-feature matrix of: 2 documents, 1 feature (0.0% sparse).
# 2 x 1 sparse Matrix of class "dfm"
# features
# docs identity
# doc1 1
# doc2 2
示例单独的字典条目:
Ireland_seperated <-
dictionary(list(identity1 = c("Ireland"),
identity2 = "Northern Ireland"))
tokens(txt) %>%
tokens_lookup(dictionary = Ireland_seperated , exclusive = FALSE)
# tokens from 2 documents.
# doc1 :
# [1] "IDENTITY2" "IDENTITY1" "is" "a" "country" "."
#
# doc2 :
# [1] "Citizens" "of" "IDENTITY1" "live" "in" "IDENTITY2" "IDENTITY1" "."
tokens(txt) %>%
tokens_lookup(dictionary = Ireland_seperated ) %>%
dfm()
# Document-feature matrix of: 2 documents, 2 features (0.0% sparse).
# 2 x 2 sparse Matrix of class "dfm"
# features
# docs identity1 identity2
# doc1 1 1
# doc2 2 1