preg_split 基于一句话

preg_split based on a Sentence

我有以下脚本来拆分句子。除了标点符号之外,还有一些短语我想作为句子的结尾。如果它是单个字符,这很好用,但当它有 space.

时就不行了

这是我的有效代码:

$re = '/# Split sentences on whitespace between them.
(?<=                # Begin positive lookbehind.
  [.!?:\#*]             # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n]              # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear  
)                   # End positive lookbehind.
(?<!                # Begin negative lookbehind.
  Mr\.              # Skip either "Mr."
| Mrs\.             # or "Mrs.",    
| Ms\.              # or "Ms.",
| Jr\.              # or "Jr.",
| Dr\.              # or "Dr.",
| Prof\.            # or "Prof.",
| U\.S\.A\.
| U\.S\.
| Sr\.              # or "Sr.",
| T\.V\.A\.         # or "T.V.A.",
| a\.m\.            # or "a.m.",
| p\.m\.            # or "p.m.",
| a€¢\.
| :\.

                    # or... (you get the idea).
)                   # End negative lookbehind.
\s+                 # Split on whitespace between sentences.

/ix';

这是我尝试添加的示例短语: "Total Gross Income"

我试过用这些方法格式化它,但是 none 行得通:

$re = '/# Split sentences on whitespace between them.
(?<=                # Begin positive lookbehind.
  [.!?:\#*]             # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n]              # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear  
| "Total Gross Income"
| Total[ X]Gross[ X]Income
| Total" "Gross" "Income
)  

例如,如果我有以下代码:

$block_o_text = "You could receive the wrong amount. If you receive more benefits than you    should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross Income Total ResourcesMedical ProgramsHousehold.";

$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_NO_EMPTY);

for ($i = 0; $i < count($sentences); ++$i) {
    echo $i . " - " . $sentance . "<BR>";
}

我得到的结果是:

77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income Total ResourcesMedical ProgramsHousehold 

我想得到的是:

77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income
83 - Total ResourcesMedical ProgramsHousehold 

我做错了什么?

你的问题是在你的回顾之后的白色 space 声明 - 它至少需要一个白色 space 才能拆分,但如果你删除它,那么你最终会捕获前面的字母并破坏了整个东西。

因此,据我所知,您不能完全通过环顾四周来做到这一点。您仍然需要让一些表达式与环视(space 前面有标点符号等)一起工作,但对于特定的短语,您不能。

您还可以使用 PREG_SPLIT_DELIM_CAPTURE 标志来捕获您正在拆分的内容。像这样的事情应该让你开始:

$re = '/((?<=[\.\?\!])\s+|Total\sGross\sIncome)/ix';

$block_o_text = "You could receive the wrong amount. If you receive more benefits than you    should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross IncomeTotal ResourcesMedical ProgramsHousehold.";

$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);

for ($i = 0; $i < count($sentences); ++$i) {
    if (!ctype_space($sentences[$i])) {
        echo $i . " - " . $sentences[$i] . "<br>";
    }
}

输出:

0 - You could receive the wrong amount.
2 - If you receive more benefits than you should, you must pay them back.
4 - When will we review your case?
6 - An eligibility review form will be sent before your benefits stop.
8 - Total Gross Income
9 - Total ResourcesMedical ProgramsHousehold.