是否有用于在某个 'word distance' 中查找关键字的 R 函数?
Is there an R function for finding keywords within a certain 'word distance'?
我需要的是在某个'word distance'中查找单词的功能。 'bag' 和 'tool' 这两个词在一个句子中很有趣 "He had a bag of tools in his car."
使用 Quanteda kwic 函数,我可以单独找到 'bag' 和 'tool',但这常常给我带来过多的结果。我需要例如'bag' 和 'tools' 彼此相差五个字以内。
您可以使用fcm()
函数来计算固定window内的共现次数,例如5个单词。这将创建一个 "feature co-occurrence matrix" 并且可以为任何大小的标记跨度或整个文档的上下文定义。
对于你的例子,或者至少是基于我对你问题的解释的例子,这看起来像:
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
txt <- c(
d1 = "He had a bag of tools in his car",
d2 = "bag other other other other tools other"
)
fcm(txt, context = "window", window = 5)
## Feature co-occurrence matrix of: 10 by 10 features.
## 10 x 10 sparse Matrix of class "fcm"
## features
## features He had a bag of tools in his car other
## He 0 1 1 1 1 1 0 0 0 0
## had 0 0 1 1 1 1 1 0 0 0
## a 0 0 0 1 1 1 1 1 0 0
## bag 0 0 0 0 1 2 1 1 1 4
## of 0 0 0 0 0 1 1 1 1 0
## tools 0 0 0 0 0 0 1 1 1 5
## in 0 0 0 0 0 0 0 1 1 0
## his 0 0 0 0 0 0 0 0 1 0
## car 0 0 0 0 0 0 0 0 0 0
## other 0 0 0 0 0 0 0 0 0 10
这里,术语 bag 在第一个文档中 tool 的 5 个标记内出现一次。在第二个文档中,它们相隔超过5个令牌,因此不算在内。
我需要的是在某个'word distance'中查找单词的功能。 'bag' 和 'tool' 这两个词在一个句子中很有趣 "He had a bag of tools in his car."
使用 Quanteda kwic 函数,我可以单独找到 'bag' 和 'tool',但这常常给我带来过多的结果。我需要例如'bag' 和 'tools' 彼此相差五个字以内。
您可以使用fcm()
函数来计算固定window内的共现次数,例如5个单词。这将创建一个 "feature co-occurrence matrix" 并且可以为任何大小的标记跨度或整个文档的上下文定义。
对于你的例子,或者至少是基于我对你问题的解释的例子,这看起来像:
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
txt <- c(
d1 = "He had a bag of tools in his car",
d2 = "bag other other other other tools other"
)
fcm(txt, context = "window", window = 5)
## Feature co-occurrence matrix of: 10 by 10 features.
## 10 x 10 sparse Matrix of class "fcm"
## features
## features He had a bag of tools in his car other
## He 0 1 1 1 1 1 0 0 0 0
## had 0 0 1 1 1 1 1 0 0 0
## a 0 0 0 1 1 1 1 1 0 0
## bag 0 0 0 0 1 2 1 1 1 4
## of 0 0 0 0 0 1 1 1 1 0
## tools 0 0 0 0 0 0 1 1 1 5
## in 0 0 0 0 0 0 0 1 1 0
## his 0 0 0 0 0 0 0 0 1 0
## car 0 0 0 0 0 0 0 0 0 0
## other 0 0 0 0 0 0 0 0 0 10
这里,术语 bag 在第一个文档中 tool 的 5 个标记内出现一次。在第二个文档中,它们相隔超过5个令牌,因此不算在内。