在 Quanteda 中将 textstat_simil 与字典或 glob 结合使用
Using textstat_simil with a dictionary or globs in Quanteda
我查看了文档,但据我了解,现在可以通过字典或 glob 使用 textstat_simil
函数。处理以下内容的最佳方法是什么?
txt <- "It is raining. It rains a lot during the rainy season"
rain_dfm <- dfm(txt)
textstat_simil(rain_dfm, "rain", method = "cosine", margin = "features")
是否需要使用tokens_replace
将"rain*"更改为"rain",或者有其他方法吗?在这种情况下,词干提取就可以解决问题,但是在不可行的情况下呢?
这是可能的,但首先您需要使用 dfm_lookup()
将 "rain*" 的 glob 匹配转换为 "rain"。 (注意:还有其他方法可以做到这一点,例如标记化然后使用 tokens_lookup()
或 tokens_replace()
,但我认为查找方法更直接,这也是您在问题中提出的。
另请注意,对于特征相似性,您必须拥有多个文档,这也解释了为什么我在此处添加了两个。
txt <- c("It is raining. It rains a lot during the rainy season",
"Raining today, and it rained yesterday.",
"When it's raining it must be rainy season.")
rain_dfm <- dfm(txt)
然后使用字典将 "rain*" 的 glob 匹配(默认)转换为 "rain",同时保留其他特征。 (在这种特殊情况下,您是正确的 dfm_wordstem()
可以完成同样的事情。)
rain_dfm <- dfm_lookup(rain_dfm,
dictionary(list(rain = "rain*")),
exclusive = FALSE,
capkeys = FALSE)
rain_dfm
## Document-feature matrix of: 3 documents, 17 features (52.9% sparse).
## 3 x 17 sparse Matrix of class "dfm"
## features
## docs it is rain . a lot during the season today , and yesterday when it's must be
## text1 2 1 3 1 1 1 1 1 1 0 0 0 0 0 0 0 0
## text2 1 0 2 1 0 0 0 0 0 1 1 1 1 0 0 0 0
## text3 1 0 2 1 0 0 0 0 1 0 0 0 0 1 1 1 1
现在,您可以计算 "rain" 的目标特征的余弦相似度:
textstat_simil(rain_dfm, selection = "rain", method = "cosine", margin = "features")
## rain
## it 0.9901475
## is 0.7276069
## rain 1.0000000
## . 0.9801961
## a 0.7276069
## lot 0.7276069
## during 0.7276069
## the 0.7276069
## season 0.8574929
## today 0.4850713
## , 0.4850713
## and 0.4850713
## yesterday 0.4850713
## when 0.4850713
## it's 0.4850713
## must 0.4850713
## be 0.4850713
我查看了文档,但据我了解,现在可以通过字典或 glob 使用 textstat_simil
函数。处理以下内容的最佳方法是什么?
txt <- "It is raining. It rains a lot during the rainy season"
rain_dfm <- dfm(txt)
textstat_simil(rain_dfm, "rain", method = "cosine", margin = "features")
是否需要使用tokens_replace
将"rain*"更改为"rain",或者有其他方法吗?在这种情况下,词干提取就可以解决问题,但是在不可行的情况下呢?
这是可能的,但首先您需要使用 dfm_lookup()
将 "rain*" 的 glob 匹配转换为 "rain"。 (注意:还有其他方法可以做到这一点,例如标记化然后使用 tokens_lookup()
或 tokens_replace()
,但我认为查找方法更直接,这也是您在问题中提出的。
另请注意,对于特征相似性,您必须拥有多个文档,这也解释了为什么我在此处添加了两个。
txt <- c("It is raining. It rains a lot during the rainy season",
"Raining today, and it rained yesterday.",
"When it's raining it must be rainy season.")
rain_dfm <- dfm(txt)
然后使用字典将 "rain*" 的 glob 匹配(默认)转换为 "rain",同时保留其他特征。 (在这种特殊情况下,您是正确的 dfm_wordstem()
可以完成同样的事情。)
rain_dfm <- dfm_lookup(rain_dfm,
dictionary(list(rain = "rain*")),
exclusive = FALSE,
capkeys = FALSE)
rain_dfm
## Document-feature matrix of: 3 documents, 17 features (52.9% sparse).
## 3 x 17 sparse Matrix of class "dfm"
## features
## docs it is rain . a lot during the season today , and yesterday when it's must be
## text1 2 1 3 1 1 1 1 1 1 0 0 0 0 0 0 0 0
## text2 1 0 2 1 0 0 0 0 0 1 1 1 1 0 0 0 0
## text3 1 0 2 1 0 0 0 0 1 0 0 0 0 1 1 1 1
现在,您可以计算 "rain" 的目标特征的余弦相似度:
textstat_simil(rain_dfm, selection = "rain", method = "cosine", margin = "features")
## rain
## it 0.9901475
## is 0.7276069
## rain 1.0000000
## . 0.9801961
## a 0.7276069
## lot 0.7276069
## during 0.7276069
## the 0.7276069
## season 0.8574929
## today 0.4850713
## , 0.4850713
## and 0.4850713
## yesterday 0.4850713
## when 0.4850713
## it's 0.4850713
## must 0.4850713
## be 0.4850713