用于捕获重复单词之间的组的正则表达式

RegEx for capturing groups between repeated words

关键字是“*OR”或“*AND”。

假设我有以下字符串:

This is a t3xt with special characters like !#. *AND and this is another text with special characters *AND this repeats *OR do not repeat *OR have more strings *AND finish with this string.

我想要以下

group1 "This is a t3xt with special characters like !#."  
group2 "*AND"  
group3 "and this is another text with special characters"  
group4 "*AND"  
group5 "this repeats"  
group6 "*OR"  
group7 "do not repeat"  
group8 "*OR"  
group9 "have more strings"  
group10 "*AND"  
group11 "finish with this string."  

我试过这样:

(.+?)(\*AND\*OR)

但它只获取第一个字符串,然后我需要不断重复代码以收集其他字符串,但问题是有些字符串只有一个 *AND,或者只有一个 *OR 或几十个,那是很随机的。下面的正则表达式也不起作用:

((.+?)(\*AND\*OR))+

例如:

This is a t3xt with special characters like !#. *AND and this is another text with special characters

PHP 有一个 preg_split 函数来处理这类事情。 preg_split 允许您通过可以定义为正则表达式模式的定界符拆分字符串。此外,它还有一个参数,允许您在 matched/split 结果中包含匹配的定界符。

因此,不是编写正则表达式来匹配全文,而是正则表达式用于分隔符本身。

示例:

$string = "This is a t3xt with special characters like !#. *AND and this is another text with special characters *AND this repeats *OR do not repeat *OR have more strings *AND finish with this string.";
$string = preg_split('~(\*(?:AND|OR))~',$string,0,PREG_SPLIT_DELIM_CAPTURE);
print_r($string);

输出:

Array
(
    [0] => This is a t3xt with special characters like !#. 
    [1] => *AND
    [2] =>  and this is another text with special characters 
    [3] => *AND
    [4] =>  this repeats 
    [5] => *OR
    [6] =>  do not repeat 
    [7] => *OR
    [8] =>  have more strings 
    [9] => *AND
    [10] =>  finish with this string.
)

但是如果你真的想坚持使用 preg_match,你将需要使用 preg_match_all,这类似于 preg_match(你在问题中标记的内容),除了它确实 global/repeated 匹配。

示例:

$string = "This is a t3xt with special characters like !#. *AND and this is another text with special characters *AND this repeats *OR do not repeat *OR have more strings *AND finish with this string.";
preg_match_all('~(?:(?:(?!\*(?:AND|OR)).)+)|(?:\*(?:AND|OR))~',$string,$matches);
print_r($matches);

输出:

Array
(
    [0] => Array
        (
            [0] => This is a t3xt with special characters like !#. 
            [1] => *AND
            [2] =>  and this is another text with special characters 
            [3] => *AND
            [4] =>  this repeats 
            [5] => *OR
            [6] =>  do not repeat 
            [7] => *OR
            [8] =>  have more strings 
            [9] => *AND
            [10] =>  finish with this string.
        )

)

首先,请注意,与 preg_splitpreg_match_all(和 preg_match)不同,return 是一个 multi-dim 数组,而不是 single-dim。其次,从技术上讲,我使用的模式可以稍微简化,但它的代价是必须在 multi-dim 数组 returned 中引用多个数组(匹配文本的一个数组,和匹配定界符的另一个数组),然后您将不得不循环遍历并替代引用; IOW 将进行额外的清理以获得包含两个匹配集的最终单个数组,如上所述。

我只展示这个方法,因为你在问题中技术上要求它,但我建议使用 preg_split,因为它消除了很多这种开销,以及为什么首先创建它(更好地解决这样的场景)。