quanteda (R) 中的 kwic 不识别正则表达式模式中的多个单词

Question

我正在尝试识别文本中的正则表达式模式，但 kwic() 无法识别长度超过一个单词的正则表达式短语。我尝试使用 phrase()，但这也不起作用。

举个例子：

mycorpus = corpus(bla$`TEXT` )
foo = kwic(mycorpus, pattern = "\bno\b", window = 10, valuetype = "regex" ) #gives 1959 obs. 
foo = kwic(mycorpus, pattern = "\bno\b\s{0,5}\w+", window = 10, valuetype = "regex" ) #gives 0 obs.
foo = kwic(mycorpus, pattern = "no\sother", window = 10, valuetype = "regex" ) #gives 0 obs. even though it should find 3 phrases

尽管文本中有多种模式需要识别。

感谢您的帮助！

Answer 1

那是因为kwic搜索token，token不再包含空格。要搜索标记序列，quanteda 将其视为 "phrase"，将模式包装在 phrase() 中。（另见 ?phrase。）

library("quanteda")
## Package version: 2.0.0

txt <- "one two three four five"

# no match
kwic(txt, "one\stwo", valuetype = "regex", window = 1)
## kwic object with 0 rows

# match
kwic(txt, phrase("one two"), valuetype = "regex", window = 1)
##                                 
##  [text1, 1:2]  | one two | three

quanteda (R) 中的 kwic 不识别正则表达式模式中的多个单词

kwic in quanteda (R) does not identify more than one word in regex pattern

regex

r

quanteda