perl 正则表达式比 c++/boost 更快

Question

我为我的网站编写了一个 CGI 脚本，它可以读取文本块并匹配所有出现的英文单词。我最近对网站的代码进行了一些根本性的更改，因此需要用 C++ 重写其中的大部分代码。正如我所希望的那样，除了这个函数之外，C++ 中几乎所有的东西都比 perl 快得多。

我知道正则表达式是 C++ 中相对较新的补充，不一定是它的最强项。在这种情况下，可能只是因为它比 perl 慢。但我想分享我的代码，希望有人能够找到一种方法来加速我在 C++ 中所做的事情。

这是 perl 代码：

open(WORD, "</file/path/wordthree.txt") || die "opening";

while(<WORD>) {
chomp;
push @wordlist,$_;
}
close (WORD) || die "closing";

foreach (@wordlist) {
   while ($bloc =~ m/$_/g) {
      $location = pos($bloc) - length($_);
      $match=$location.";".pos($bloc).";".$_;
      push(@hits,$match);
   }
}

wordthree.txt 是一个由换行分隔的约 270,000 个英文单词的列表，$bloc 是 3200 个字符的文本。 Perl 在大约一秒钟内执行这些搜索。如果你喜欢，你可以在这里看到它在播放：http://libraryofbabel.info/anglishize.cgi?05y-w1-s3-v20:1

对于 C++，我尝试了以下方法：

typedef std::map<std::string::difference_type, std::string> hitmap;
hitmap hits;

void regres(const boost::match_results<std::string::const_iterator>& what) {
hits[what.position()]=what[0].str();
}

words.open ("/file/path/wordthree.txt");
std::string wordlist[274784];
unsigned i = 0;
while (words >> wordlist[i]) {i++;}
words.close();

for (unsigned i=0;i<274783;i++) {
    boost::regex word(wordlist[i]);
    boost::sregex_iterator lex(book.begin(),book.end(), word);
    boost::sregex_iterator end;
    std::for_each(lex, end, &regres);
}

C++ 版本读取相同数量的文本相同次数大约需要 12 秒。非常感谢任何关于如何使其与 perl 脚本竞争的建议。

Answer 1

在我看来，Perl 足够聪明，可以发现您在滥用正则表达式来进行普通的线性搜索，即直接查找。您正在查找纯文本，none 您的搜索模式似乎是一个模式。根据您的描述，您所有的搜索模式看起来都像普通字符串，因此 Perl 可能会将其优化为线性字符串搜索。

我不熟悉Boost的正则表达式匹配的内部实现，但很可能它正在将每个搜索字符串编译成一个状态机，然后为每个搜索执行状态机。这是与通用正则表达式实现一起使用的常用方法。这是很多工作。在这种特定情况下，有很多完全不必要的工作。

你应该做的是：

1) 您正在将 wordthree.txt 读入字符串数组。不要这样做，而是将其读入 std::set<std::string>。

2) 您正在阅读整个文本以搜索单个 book 容器。根据您的代码，尚不清楚 book 是单个 std::string 还是 std::vector<char>。但无论如何，不要那样做。阅读文本以迭代搜索，一次搜索一个词。对于每个单词，在 std::set 中查找，然后从那里开始。

毕竟，这是您要直接尝试做的事情，您应该这样做，而不是绕道而行，通过正则表达式的奇妙之处，正则表达式除了浪费大量时间外，收效甚微。

如果你正确地实现它，你可能会发现 C++ 与 Perl 一样快，如果不是更快的话。

我还可以想到其他几种更积极优化的方法，它们也利用 std::set，但使用自定义类和比较器，旨在避免使用 bunch 所固有的所有堆分配std::string，但可能没有必要。使用基于 std::set 的查找的基本方法应该足够快。

Answer 2

首先我会减少分配的数量：

尽可能使用 string_ref 而不是 std::string
使用映射文件而不是提前将其全部读取到内存中
使用 const char* 代替 std::string::const_iterator 浏览本书

这里有一个使用升灵气解析词表的示例（我没有你的，所以我假设是行分隔的词）。

std::vector<sref> wordlist;
io::mapped_file_source mapped("/etc/dictionaries-common/words");
qi::parse(mapped.begin(), mapped.end(), qi::raw[+(qi::char_ - qi::eol)] % qi::eol, wordlist);

完整 Live On Coliru¹

#include <boost/regex.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/iostreams/device/mapped_file.hpp>

namespace qi = boost::spirit::qi;
namespace io = boost::iostreams;
using sref   = boost::string_ref;
using regex  = boost::regex;

namespace boost { namespace spirit { namespace traits {
    template <typename It>
    struct assign_to_attribute_from_iterators<sref, It, void> {
        static void call(It f, It l, sref& attr) { attr = { f, size_t(std::distance(f,l)) }; }
    };
} } }

typedef std::map<std::string::difference_type, sref> hitmap;
hitmap hits;

void regres(const boost::match_results<const char*>& what) {
    hits[what.position()] = sref(what[0].first, what[0].length());
}

int main() {
    std::vector<sref> wordlist;
    io::mapped_file_source mapped("/etc/dictionaries-common/words");
    qi::parse(mapped.begin(), mapped.end(), qi::raw[+(qi::char_ - qi::eol)] % qi::eol, wordlist);

    std::cout << "Wordlist contains " << wordlist.size() << " entries\n";

    io::mapped_file_source book("/etc/dictionaries-common/words");

    for (auto const& s: wordlist) {
        regex word(s.to_string());
        boost::cregex_iterator lex(book.begin(), book.end(), word), end;
        std::for_each(lex, end, &regres);
    }
}

下一步

这仍然会在每次迭代时创建一个正则表达式。我怀疑如果将它们全部组合成一个模式，效率会提高 lot。您将花费更多 memory/CPU 创建正则表达式，但您将通过单词列表中的条目数减少循环的能力。

因为正则表达式库可能不是为这种规模设计的，您可以通过自定义搜索和 trie 实现获得更好的结果。

这是一个简单的尝试（对于我的 99171 行的 /etc/dictionaries-common/words 文件来说确实要快得多）：

Live On Coliru

#include <boost/regex.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/iostreams/device/mapped_file.hpp>

namespace io = boost::iostreams;
using sref   = boost::string_ref;
using regex  = boost::regex;

typedef std::map<std::string::difference_type, sref> hitmap;
hitmap hits;

void regres(const boost::match_results<const char*>& what) {
    hits[what.position()] = sref(what[0].first, what[0].length());
}

int main() {
    io::mapped_file_params params("/etc/dictionaries-common/words");
    params.flags = io::mapped_file::mapmode::priv;
    io::mapped_file mapped(params);
    std::replace(mapped.data(), mapped.end(), '\n', '|');

    regex const wordlist(mapped.begin(), mapped.end() - 1);

    io::mapped_file_source book("/etc/dictionaries-common/words");

    boost::cregex_iterator lex(book.begin(), book.end(), wordlist), end;
    std::for_each(lex, end, &regres);
}

¹ 当然 coliru 没有合适的词表

perl 正则表达式比 c++/boost 更快

perl regex faster than c++/boost

c++

regex

optimization

perl

boost

下一步