为什么我只得到第一个捕获组？

Question

（ and 没有帮助我）

分析 Linux 中 /proc/stat 的问题我开始编写一个小实用程序，但无法按照我想要的方式获取捕获组。这是代码：

#!/usr/bin/perl
use strict;
use warnings;

if (open(my $fh, '<', my $file = '/proc/stat')) {
    while (<$fh>) {
        if (my ($cpu, @vals) = /^cpu(\d*)(?:\s+(\d+))+$/) {
            print "$cpu $#vals\n";
        }
    }
    close($fh);
} else {
    die "$file: $!\n";
}

例如，使用这些输入行我得到输出：

> cat /proc/stat
cpu  2709779 13999 551920 11622773 135610 0 194680 0 0 0
cpu0 677679 3082 124900 11507188 134042 0 164081 0 0 0
cpu1 775182 3866 147044 38910 135 0 15026 0 0 0
cpu2 704411 3024 143057 37674 1272 0 8403 0 0 0
cpu3 552506 4025 136918 38999 160 0 7169 0 0 0
intr 176332106  ...

所以匹配确实有效，但我没有将捕获组放入 @vals（perls 5.18.2 和 5.26.1）。

Answer 1

Perl 的正则表达式引擎只会记住重复表达式中的 last 捕获组。如果您想在单独的捕获组中捕获每个数字，那么一种选择是使用显式正则表达式模式：

if (open(my $fh, '<', my $file = '/proc/stat')) {
    while (<$fh>) {
        if (my ($cpu, @vals) = /^cpu(\d*)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)$/) {
            print "$cpu $#vals\n";
        }
    }
    close($fh);
} else {
    die "$file: $!\n";
}

Answer 2

按照示例输入，遵循 while 循环内的内容应该有效。

if (/^cpu(\d*)/) {
    my $cpu = ;
    my (@vals) = /(?:\s+(\d+))+/g;
    print "$cpu $#vals\n";
}

Answer 3

刚刚添加到 :

您可以用一组捕获多个值（使用 g 修饰符），但是您必须拆分语句。

    if (my ($cpu) = /^cpu(\d*)(?:\s+(\d+))+$/) {
        my @vals= /(?:\s+(\d+))/g;
        print "$cpu $#vals\n";
    }

Answer 4

正在替换

    while (<$fh>) {
        if (my ($cpu, @vals) = /^cpu(\d*)(?:\s+(\d+))+$/) {

和

    while (<$fh>) {
        my @vals;
        if (my ($cpu) = /^cpu(\d*)(?:\s+(\d+)(?{ push(@vals, $^N) }))+$/) {

做我想做的事（需要 perl 5.8 或更新版本）。

Answer 5

仅捕获单个模式中的最后一个重复匹配项。

相反，可以拆分行，然后检查并调整第一个字段

while (<$fh>) {
    my ($cpu, @vals) = split;
    next if not $cpu =~ s/^cpu//;
    print "$cpu $#vals\n";
}

如果 split 的 return 的第一个元素不是以 cpu 开头，则正则表达式替换失败，因此跳过该行。否则，您将获得 cpu 之后的数字（或空字符串），如 OP 中所示。^†

或者，可以使用您处理的行的特定结构

while (<$fh>) {
    if (my ($cpu, @vals) = map { split } /^cpu([0-9]*) \s+ (.*)/x) { 
        print "$cpu $#vals\n";
    }
}

正则表达式 return 有两个项目，每个都是 map 中的 split，除了第一个只是按原样传递给 $cpu（要么一个数字或一个空字符串），而另一个产生数字。

这两个都会在我的测试中产生所需的输出。

^† 由于我们总是检查 ^cpu（并将其删除），因此首先检查 split 是有意义的——当需要。但是，由于以下原因，这有点棘手。

裸 split 默认去除前导（和尾随）白色 space，因此对于 cpu 字符串没有尾随数字的行（cpu 2709779... ) 我们最终会得到 next 编号，因为它应该是 cpu 名称！一个安静的错误。

因此我们需要为 split 指定使用 spaces，因为它会留下前导 spaces

while (<$fh>) {
    next if not s/^cpu//;
    my ($cpu, @vals) = split /\s+/;  # now $cpu may be space(s)
    print "$cpu $#vals\n";
}

这现在按预期工作，因为没有尾随数字的 cpu 得到 space(s)，这是一个需要处理但很清楚的案例。但这是一种误导，一个不知情的维护者——或者众所周知的六个月后的我们——可能会试图删除看似“不需要的”/\s+/，从而引入错误。

Answer 6

在 Learning Perl, we state a problem that's easy to solve with two simple regexes but hard with one (but then in Mastering Perl 的练习中，我拿出了大炮）。我们不告诉人们这一点，因为我们想强调尝试在单个正则表达式中编写所有内容的自然行为。其他答案中的一些扭曲让我想起了这一点，我不想保留任何一个。

首先，存在只处理感兴趣的行的问题。然后，一旦我们有了那条线，就获取所有数字。将该问题陈述翻译成代码非常简单明了。这里没有杂技，因为断言和锚点完成了大部分工作：

use v5.10;

while( <DATA> ) {
    next unless /\A cpu(\d*) \s /ax;
    my $cpu = ;
    my @values = / \b (\d+) \b /agx;
    say "$cpu " . @values;
    }

__END__
cpu  2709779 13999 551920 11622773 135610 0 194680 0 0 0
cpu0 677679 3082 124900 11507188 134042 0 164081 0 0 0
cpu1 775182 3866 147044 38910 135 0 15026 0 0 0
cpu2 704411 3024 143057 37674 1272 0 8403 0 0 0
cpu3 552506 4025 136918 38999 160 0 7169 0 0 0
intr 176332106  ...

请注意，OP 仍需决定如何处理 cpu 没有尾随数字的情况。不知道你想用空字符串做什么。

Answer 7

他是我的榜样。我想我会添加它，因为我喜欢简单的代码。它还允许没有尾随数字的“cpu7”。

#!/usr/bin/perl
use strict;
use warnings;

my $file = "/proc/stat";
open(my $fh, "<", $file) or die "$file: $!\n";
while (<$fh>) 
{
  if ( /^cpu(\d+)(\s+)?(.*)$/ ) 
  {
    my $cpu = ; 
    my $vals = scalar split( /\s+/,  ) ;
    print "$cpu $vals\n";
  }
}
close($fh);

为什么我只得到第一个捕获组？

Why do I get the first capture group only?

regex

perl

regex-group