使用正则表达式拆分键值中的字符串

Splitting string in key value with regex

我在解析来自 samtools stats 的纯文本输出时遇到了一些问题。

示例输出:

45205768 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
5203838 + 0 duplicates
44647359 + 0 mapped (98.76% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

我想逐行解析文件并在 PHP 数组中获得以下输出,如下所示:

Array(
 "in total" => [45205768,0],
 ...
)

所以,长话短说,我想从行的前面获取数值作为整数数组,并将以下字符串(不带括号)作为键。

我想这就是你想要的:

^(\d+)(\s\+\s)(\d+)(.+)

See it work here on Regex101 接第一组和第三组

^(\d+)\s\+\s(\d+)\s([a-zA-Z0-9 ]+).*$

此正则表达式将第一个值、第二个值和后面的不带括号的字符串分别放在匹配组 1、2 和 3 中。

Regex101 demo

只需两个捕获组和全字符串匹配即可解决此问题。

我的模式准确地提取了所需的子字符串并修剪了待声明的尾随空格 "keys":Pattern Demo

^(\d+) \+ (\d+) \K[a-z\d ]+(?=\s)  #244steps

PHP代码:(Demo)

$txt='45205768 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
5203838 + 0 duplicates
44647359 + 0 mapped (98.76% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)';

preg_match_all('/^(\d+) \+ (\d+) \K[a-z\d ]+(?=\s)/m',$txt,$out);
foreach($out[0] as $k=>$v){
    $result[$v]=[(int)$out[1][$k],(int)$out[2][$k]];  // re-casting strings as integers
}
var_export($result);

输出:

array (
  'in total' => array (0 => 45205768, 1 => 0),
  'secondary' => array (0 => 0, 1 => 0),
  'supplementary' => array (0 => 0, 1 => 0),
  'duplicates' => array (0 => 5203838, 1 => 0),
  'mapped' => array (0 => 44647359, 1 => 0),
  'paired in sequencing' => array (0 => 0, 1 => 0),
  'read1' => array (0 => 0, 1 => 0),
  'read2' => array (0 => 0, 1 => 0),
  'properly paired' => array (0 => 0, 1 => 0),
  'with itself and mate mapped' => array (0 => 0, 1 => 0),
  'singletons' => array (0 => 0, 1 => 0),
  'with mate mapped to a different chr' => array ( 0 => 0, 1 => 0)
)

请注意,输入文本的最后两行在 $result 数组中生成了一个重复键,这意味着前面一行的数据被后面一行的数据覆盖。如果这是一个问题,您可以重组输入数据或只保留括号部分作为唯一性键的一部分。