quanteda (R) 中的 kwic 不识别正则表达式模式中的多个单词
kwic in quanteda (R) does not identify more than one word in regex pattern
我正在尝试识别文本中的正则表达式模式,但 kwic() 无法识别长度超过一个单词的正则表达式短语。我尝试使用 phrase()
,但这也不起作用。
举个例子:
mycorpus = corpus(bla$`TEXT` )
foo = kwic(mycorpus, pattern = "\bno\b", window = 10, valuetype = "regex" ) #gives 1959 obs.
foo = kwic(mycorpus, pattern = "\bno\b\s{0,5}\w+", window = 10, valuetype = "regex" ) #gives 0 obs.
foo = kwic(mycorpus, pattern = "no\sother", window = 10, valuetype = "regex" ) #gives 0 obs. even though it should find 3 phrases
尽管文本中有多种模式需要识别。
感谢您的帮助!
那是因为kwic搜索token,token不再包含空格。要搜索标记序列,quanteda 将其视为 "phrase",将模式包装在 phrase()
中。 (另见 ?phrase
。)
library("quanteda")
## Package version: 2.0.0
txt <- "one two three four five"
# no match
kwic(txt, "one\stwo", valuetype = "regex", window = 1)
## kwic object with 0 rows
# match
kwic(txt, phrase("one two"), valuetype = "regex", window = 1)
##
## [text1, 1:2] | one two | three
我正在尝试识别文本中的正则表达式模式,但 kwic() 无法识别长度超过一个单词的正则表达式短语。我尝试使用 phrase()
,但这也不起作用。
举个例子:
mycorpus = corpus(bla$`TEXT` )
foo = kwic(mycorpus, pattern = "\bno\b", window = 10, valuetype = "regex" ) #gives 1959 obs.
foo = kwic(mycorpus, pattern = "\bno\b\s{0,5}\w+", window = 10, valuetype = "regex" ) #gives 0 obs.
foo = kwic(mycorpus, pattern = "no\sother", window = 10, valuetype = "regex" ) #gives 0 obs. even though it should find 3 phrases
尽管文本中有多种模式需要识别。
感谢您的帮助!
那是因为kwic搜索token,token不再包含空格。要搜索标记序列,quanteda 将其视为 "phrase",将模式包装在 phrase()
中。 (另见 ?phrase
。)
library("quanteda")
## Package version: 2.0.0
txt <- "one two three four five"
# no match
kwic(txt, "one\stwo", valuetype = "regex", window = 1)
## kwic object with 0 rows
# match
kwic(txt, phrase("one two"), valuetype = "regex", window = 1)
##
## [text1, 1:2] | one two | three