使用正则表达式标记字符串

Question

假设我有一个这样的字符串：abc def ghi jkl（为了简单起见，我在最后放了一个 space，但这对我来说并不重要）我想捕获其"chunks"如下：

abc

def

ghi

jkl

当且仅当字符串中有1-4个"chunks"。我已经尝试过以下正则表达式：

^([^ ]+ ){1,4}$

在 Regex101.com 但它只捕获最后一次出现。发出有关它的警告：

A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data

如何更正正则表达式以实现我的目标？

Answer 1

可以在 linux 上使用 tr 完成：

tr -sc 'a-zA-Z' '\n' < text.txt > out_text.txt

text.txt 文件中的哪个位置是您要规范化的字符串。

Answer 2

由于您无权访问代码，您可能使用的唯一解决方案是基于 \G 运算符的正则表达式，它只允许连续匹配和锚定在开头的前瞻，这需要 1 到字符串中有 4 个非空白块。

(?:^(?=\s*\S+(?:\s+\S+){0,3}\s*$)|\G(?!^))\s*\K\S+

见regex demo

详情:

(?:^(?=\s*\S+(?:\s+\S+){0,3}\s*$)|\G(?!^)) - 自定义边界，检查是否：
- ^(?=\s*\S+(?:\s+\S+){0,3}\s*$) - 字符串起始位置 (^) 后跟 1 到 4 个非空白块，用 1+ 个空白分隔，允许 trailing/leading 个空白，太
- | - 或
- \G(?!^) - 上一次成功匹配结束时的当前位置（\G 也匹配字符串的开头，因此我们必须使用否定先行来排除该匹配位置，因为执行了单独的检查）
\s* - 零个或多个空格
\K - 匹配重置运算符，丢弃目前匹配的所有文本
\S+ - 除空格外的 1 个或多个字符

使用正则表达式标记字符串

Tokenizing a string with a regular expression

regex

pcre

tokenize