Perl:匹配一个大数组

Perl: match against a large array

假设我有 $sentence = "The quick brown fox jumps over the lazy dog" 和一个数组 @targets = qw(brown red black lazy quick);.

目标是找出@targets中包含在$sentence中的所有单词,在本例中,就是这三个单词:brown lazy quick.

注意:@targets 数组可能非常大,其中可能有 10,000 个元素。并且可能有 10,000 $sentence 需要处理。

有没有一种高效优雅的方式来做到这一点?

这是一个使用 Regexp::Assemble 的例子:

use feature qw(say);
use strict;
use warnings;
use Regexp::Assemble;

my $str = "The quick brown fox jumps over the lazy dog";
my @targets = qw(brown lazy quick);
my $ra = Regexp::Assemble->new;
$ra->add($_) for @targets;
my $regex = $ra->re;
my @matches = $str =~ /($regex)/g;
say for @matches;  # optionally add List::Util::uniq if you want the list 
                     to be unique

输出:

quick
brown
lazy

看看这个逻辑对你有没有帮助:

#!/usr/bin/perl

use strict;
use warnings;

my $sentence = "The quick brown fox jumps over the lazy dog";

my @targets = qw/brown red black lazy quick/;
my %targets = map { $_ => 1 } @targets;

my @words = split(' ', $sentence);

foreach my $each_word ( @words ){
    print $each_word."\n" if $targets{$each_word};
}

输出:

quick
brown
lazy

使用 Algorithm::AhoCorasick::XS,它可以非常有效地搜索多个字符串:

#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
use Algorithm::AhoCorasick::XS;

my @targets = qw(brown red black lazy quick);
my $matcher = Algorithm::AhoCorasick::XS->new(\@targets);

my @sentences = (
    "The quick brown fox jumps over the lazy dog",
    "The man in black fled across the desert and the gunslinger followed."
);


for my $sentence (@sentences) {
    my @matches = $matcher->matches($sentence); # Or unique_matches
    say "@matches" if @matches;
}

示例:

$ perl demo.pl
quick brown lazy
black

仅使用纯 perl

use strict;
use warnings;
use feature 'say';

my $sentence = "The quick brown fox jumps over the lazy dog";

my @targets   = qw/brown red black lazy quick/;
my $filter    = join('|',@targets);
my $re_filter = qr/\b($filter)\b/;

my @found = $sentence =~ /$re_filter/g;

say for @found;

输出

quick
brown
lazy

我对其他答案中介绍的方法进行了快速比较:

use feature qw(say);
use strict;
use warnings;
use Algorithm::AhoCorasick::XS;
use Benchmark qw(cmpthese);
use Regexp::Assemble;

{
    my @words = ("barbarous", "blandit", "celery", "gravida", "pencil", "move", "rain",
            "diligent", "letter", "staking", "enjoy", "salty", "abnormal", "volatile",
            "scared", "unsightly", "curabitor", "possible", "morbi", "open", "road");

    my $str = <<'END';
Lorem ipsum dolor sit amet consectetur adipiscing elit Vivamus tempor nulla condimentum ex posuere sed vulputate purus laoreet Etiam ut justo pulvinar lobortis leo in dictum ligula Pellentesque porta et justo a vestibulum Fusce et nulla vitae nisi scelerisque iaculis Donec non mi ligula Donec et ullamcorper eros Duis tincidunt ante dui vel vestibulum risus gravida molestie Nam eleifend placerat enim ultricies eleifend Pellentesque dapibus orci quis nulla faucibus ut rhoncus urna interdum Nam lobortis vitae nunc vel bibendum Nulla sed condimentum justo
Donec mauris felis fermentum eu libero sed cursus fermentum nunc Fusce dolor sapien ultrices nec egestas vel aliquam in mi Duis ullamcorper elit ut ornare tincidunt urna velit ultrices est at molestie urna risus faucibus libero Cras nec quam in orci efficitur tristique venenatis sit amet odio Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas Pellentesque pretium metus eu eros vulputate non ornare orci imperdiet Ut pharetra efficitur urna sit amet sodales Donec pharetra eros sit amet dolor ullamcorper aliquet Cras ac ante sit amet erat fermentum consequat eget a turpis Donec a tellus ex Sed suscipit posuere hendrerit Praesent suscipit ante et tortor tempor congue Lorem ipsum dolor sit amet consectetur adipiscing elit Nunc tincidunt erat sit amet venenatis finibus ante massa ornare lacus at euismod tortor neque at velit
Sed ultricies velit ante vel convallis massa accumsan a Curabitur mauris purus tempus eu leo vitae fringilla porttitor augue Pellentesque mattis sed turpis quis viverra Aenean a mauris nec urna sollicitudin bibendum Etiam elementum efficitur lacus nec finibus Morbi ut dolor iaculis porttitor magna quis posuere arcu Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Duis at luctus sem Suspendisse vel erat porttitor lobortis tortor ac suscipit nulla Cras maximus dapibus odio sit amet aliquet Curabitur iaculis odio at aliquet tincidunt Suspendisse potenti Vivamus maximus est id rhoncus consectetur nisi purus rutrum massa nec ullamcorper risus nisi ut metus
Phasellus lobortis mauris turpis ut sollicitudin felis ultrices vel Ut tristique felis gravida est varius rutrum Sed porttitor pharetra sapien ac gravida urna Curabitur quis dignissim turpis Phasellus turpis sapien scelerisque vitae turpis tincidunt faucibus tempus tortor Proin bibendum elementum libero fermentum rhoncus Ut scelerisque sem eget rutrum scelerisque ex felis scelerisque augue vitae elementum lorem enim sit amet erat Sed congue volutpat mi pellentesque dapibus mi tempor quis Mauris nibh lorem mattis in nibh sed lacinia sagittis urna Etiam posuere odio non nisi placerat fermentum Mauris elit massa fermentum quis hendrerit ut condimentum a risus
Cras ut leo rhoncus laoreet turpis quis consectetur sem Fusce gravida neque et mauris molestie fermentum Proin non vehicula tellus vel volutpat est Nullam quis ornare ligula Nunc sed quam orci Nullam enim dolor cursus interdum fermentum sit amet gravida eu ex Aliquam nec porta mi Ut varius ut dolor vel convallis Curabitur massa purus varius nec diam eget sollicitudin finibus nisi Mauris pharetra in odio in rutrum Vestibulum vel eleifend ex Mauris tellus ex tempor ut dolor luctus facilisis pharetra justo
Cras elementum enim a egestas bibendum sapien ligula convallis tellus in dignissim augue massa vel diam Fusce quis mi diam Phasellus mollis est non ultricies dictum Nullam ullamcorper leo quis lacinia scelerisque libero enim faucibus dui placerat sodales dolor lectus sed ligula Nunc interdum justo vitae faucibus faucibus neque dui cursus nunc ut gravida dui urna sed ipsum Ut luctus lacus nulla id imperdiet elit fermentum sit amet Mauris nulla erat gravida ac enim ut blandit blandit felis Ut feugiat consequat vulputate Mauris erat neque vehicula ac purus ut tristique fringilla nulla
END

    cmpthese(
        -1,
        {
            regexp_assemble => sub { regexp_assemble( $str, \@words ) },
            aho_corasick    => sub { aho_corasick( $str, \@words ) },
            pp_trie         => sub { pp_trie( $str, \@words ) },
            hash_lookup     => sub { hash_lookup( $str, \@words ) },
        }
    );
}

sub hash_lookup {
    my ( $str, $targets ) = @_;

    my %targets = map { $_ => 1 } @$targets;
    my @words = split ' ', $str;
    my @matches;
    foreach my $each_word ( @words ){
        push @matches, $each_word if $targets{$each_word};
    }
    return \@matches;
}

sub pp_trie {
    my ( $str, $targets ) = @_;

    my $filter    = join '|', map {quotemeta} @$targets;
    my $re_filter = qr/\b($filter)\b/;
    my @matches = $str =~ /$re_filter/g;
    return \@matches;
}

sub aho_corasick {
    my ( $str, $targets ) = @_;

    my $matcher = Algorithm::AhoCorasick::XS->new($targets);
    my @matches = $matcher->matches($str);
    return \@matches;
}

sub regexp_assemble {
    my ( $str, $targets ) = @_;
    my $ra = Regexp::Assemble->new;
    $ra->add($_) for @$targets;
    my $regex = $ra->re;
    my @matches = $str =~ /\b($regex)\b/g;
    return \@matches;
}

输出:

                   Rate regexp_assemble   hash_lookup       pp_trie aho_corasick
regexp_assemble  1935/s              --          -86%          -90%         -90%
hash_lookup     14221/s            635%            --          -24%         -28%
pp_trie         18618/s            862%           31%            --          -5%
aho_corasick    19651/s            916%           38%            6%           --

因此对于此设置,aho_corasick 是最快的。