使用 kwic() 搜索高级正则表达式模式

Question

我想使用 kwic() 在具有更高级正则表达式短语的文本中查找模式，但我正在努力解决 kwic() 标记化短语的方式，并且出现了两个问题：

1) 如何对包含空格的短语使用分组：

kwic(text, pattern = phrase("\b(address|g[eo]t? into|gotten into)\b \bno\b"), valuetype="regex")

Error in stri_detect_regex(types_search, pattern, case_insensitive = case_insensitive) : Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)

2）如何寻找更长的单词序列（与第一题类似）：

kwic("this is a test", pattern= phrase("(\w+\s){1,3}"), valuetype="regex", remove_separator=FALSE)

kwic object with 0 rows

kwic("this is a test", pattern= phrase("(\w+ ){0,2}"), valuetype="regex", remove_separator=FALSE)

Error in stri_detect_regex(types_search, pattern, case_insensitive = case_insensitive) : Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)

感谢任何提示！

Answer 1

使用 phrase() 需要理解的一点是，它可以创建由白色 space 分隔的模式序列作为单个字符值。它不应该，至少对于正常使用，包括白色space分隔符作为模式的一部分。

我为你问题的第一部分选择了一个可重现的例子，我认为它可以说明问题并回答你的问题。

在这里，我们简单地将不同的模式放入 phrase() 中，它们之间有一个 space。这相当于将它们包装在 list() 中，并将单独模式的序列放入字符向量的元素中。

library("quanteda")
#> Package version: 2.0.1

kwic("a b c a b d e", pattern = phrase("b c|d"), valuetype = "regex")
#>                                      
#>  [text1, 2:3]       a | b c | a b d e
#>  [text1, 5:6] a b c a | b d | e
kwic("a b c a b d e", pattern = list(c("b", "c|d")), valuetype = "regex")
#>                                      
#>  [text1, 2:3]       a | b c | a b d e
#>  [text1, 5:6] a b c a | b d | e

我们还可以考虑一个序列匹配向量，包括具有非常包容性的匹配，例如下面的 ".+ ^a$" 匹配 1 个或多个字符的任何序列，后跟标记 "a"。请注意 ^$ 如何明确表示这是（单标记）正则表达式的开始和结束。

kwic("a b c a b d e", pattern = phrase(c("b c|d", ".+ ^a$")), valuetype = "regex")
#>                                      
#>  [text1, 2:3]       a | b c | a b d e
#>  [text1, 3:4]     a b | c a | b d e  
#>  [text1, 5:6] a b c a | b d | e

对于第二部分，您可以使用通配符匹配来匹配任何内容，使用默认的 "glob" 匹配最简单：

kwic("this is a test", pattern = phrase("* * *"))
#>                                      
#>  [text1, 1:3]      | this is a | test
#>  [text1, 2:4] this | is a test |

kwic("this is a test", pattern = phrase("* *"))
#>                                         
#>  [text1, 1:2]         | this is | a test
#>  [text1, 2:3]    this |  is a   | test  
#>  [text1, 3:4] this is | a test  |

最后请注意，可以将 whitespace 作为模式匹配的一部分，但前提是您拥有包含 whitespace 的标记.如果您要通过 ... 将 remove_separators = FALSE 参数传递给 tokens() 调用（请参阅 ?kwic），或者如果您以其他方式创建令牌确保它们包含白色space.

as.tokens(list(d1 = c("a b", " ", "c"))) %>%
  kwic(phrase("\s"), valuetype = "regex")
#>                        
#>  [d1, 1]     | a b |  c
#>  [d1, 2] a b |     | c

那里显示的"a b"实际上是单个标记"a b"，而不是标记序列"a"、"b"。第二行的空白是“ ”标记。

^{由 reprex package (v0.3.0)}

于 2020-03-31 创建

使用 kwic() 搜索高级正则表达式模式

Searching for advanced regex patterns with kwic()

regex

r

quanteda