正则表达式提升库 regex_search 匹配错误

Question

当我使用 boost::regex_search 函数做这样的事情时

std::string sTest = "18";
std::string sRegex = "^([\u4E00-\u9FA5]+)(\d+)(#?)$";
std::string::const_iterator iterStart = sTest.begin();
std::string::const_iterator iterEnd = sTest.end();
boost::match_results<std::string::const_iterator> RegexResults;

while (boost::regex_search(iterStart, iterEnd, RegexResults, boost::regex(sRegex)))
{
    int a = 1;
    break; 
}

但是值 'stest' 是匹配的，但是当我使用 std::regex_search 时没问题。

Answer 1

假设问题很严重：

正则表达式匹配

^（输入开始）
您想要的 “CJK 统一表意文字”中的一种或多种 block（尽管仅来自 1.0.1 Unicode 标准）。

然而，这不是解析的内容。（相反，它确实被解析为与 1 匹配的常规十六进制转义）。

docs 告诉我你可能想要 \x{dddd} 但这需要 Unicode 支持。

深入挖掘 docs 告诉我

There are two ways to use Boost.Regex with Unicode strings:

Rely on wchar_t

(lists a bunch of limitations and conditions)

Use a Unicode Aware Regular Expression Type.

我可以建议后者
一个或多个数字（根据语言环境的字符classicification）
(#?) 可能原本打算作为注释 ((?#)) 但拼写可选地匹配单个 # 字符
后跟[=21=]（输入结束）

这不在您的输入中，因此它不应该匹配。

题外话？

此外，由于这是一个完全锚定的模式 (^$)，所以它才有意义使用 regex_match，而不是 regex_search。

while 循环是一个奇怪的想法，因为输入永远不会改变，所以也不会搜索结果。如果匹配，循环总是会中断。您的 while 相当于一个更令人困惑的 if 陈述。

修复

这是一个用三种不同的正则表达式编写三种可切换方法的程序。

方法是BOOST_SIMPLE、BOOST_UNICODE和STANDARD_LIB。

第一个正则表达式是你的，第二个 \x{XXXX} 转义，然后第三个是使用命名字符 class \p{InCJK_Unified_Ideographs}.

结果是：

Boost + Unicode 方法工作正常（除了你有问题的正则表达式）
标准库（在我的 GCC 10 安装上）以及 Boost Simple 似乎接受命名字符 class。我怀疑它是否真的匹配正确，所以我猜它会匹配失败。

清单

Live On Coliru

#include <iostream>
#include <iomanip>

#ifdef STANDARD_LIB
  #include <regex>
  using match = std::smatch;
  using regex = std::regex;
  #define ctor regex
#elif defined(BOOST_SIMPLE)
  #include <boost/regex.hpp>
  using match = boost::smatch;
  using regex = boost::regex;
  #define ctor regex
#elif defined(BOOST_UNICODE)
  #include <boost/regex/icu.hpp>
  using match = boost::smatch;
  using regex = boost::u32match;
  #define ctor        boost::make_u32regex
  #define regex_match boost::u32regex_match
#else
  #error "Need to pick a flavour"
#endif

int main() {
    std::string const sTest = "18";

    struct testcase { std::string_view label, re; };

    for (auto current : { testcase
        { "wrong unicode character class",
            R"(^([\u4E00-\u9FA5]+)(\d+)(#?)$)" },
        { "correct unicode character class",
            R"(^([\x{4E00}-\x{9FA5}]+)(\d+)(#?)$)" },
        { "named character class",
            R"(^([\p{InCJK_Unified_Ideographs}]+)(\d+)(#?)$)" },
    }) {
        std::cout
            << std::string(current.label.length(), '=') << "\n"
            << current.label << " (" << std::quoted(current.re) << ")\n";

        try {
            match results;
            bool is_match = regex_match(sTest, results, ctor(current.re.data()));

            std::cout << "is_match: " << std::boolalpha << is_match << "\n";
            if (is_match) {
                std::cout << ": " << std::quoted(results[1].str()) << "\n";
                std::cout << ": " << std::quoted(results[2].str()) << "\n";
                std::cout << ": " << std::quoted(results[3].str()) << "\n";
            }
        } catch(std::exception const& e) {
            std::cerr << "failure: " << e.what() << "\n";
        }
    }
}

编译

g++ -DBOOST_SIMPLE  -O2 -std=c++17 main.cpp -lboost_regex -o boost_simple
g++ -DBOOST_UNICODE -O2 -std=c++17 main.cpp -lboost_regex -o boost_unicode -licuuc
g++ -DSTANDARD_LIB  -O2 -std=c++17 main.cpp -o Standard_lib

有输出：

文件boost_simple.log

 =============================
 wrong unicode character class ("^([\u4E00-\u9FA5]+)(\d+)(#?)$")
 is_match: true
 : "1"
 : "8"
 : ""
 ===============================
 correct unicode character class ("^([\x{4E00}-\x{9FA5}]+)(\d+)(#?)$")
 failure: Hexadecimal escape sequence was invalid.  The error occurred while parsing the regular expression fragment: '^([>>>HERE>>>\x{4E00}-\'.
 =====================
 named character class ("^([\p{InCJK_Unified_Ideographs}]+)(\d+)(#?)$")
 is_match: false

文件boost_unicode.log

 =============================
 wrong unicode character class ("^([\u4E00-\u9FA5]+)(\d+)(#?)$")
 is_match: true
 : "1"
 : "8"
 : ""
 ===============================
 correct unicode character class ("^([\x{4E00}-\x{9FA5}]+)(\d+)(#?)$")
 is_match: false
 =====================
 named character class ("^([\p{InCJK_Unified_Ideographs}]+)(\d+)(#?)$")
 is_match: false

文件standard_lib.log

 =============================
 wrong unicode character class ("^([\u4E00-\u9FA5]+)(\d+)(#?)$")
 failure: Invalid range in bracket expression.
 ===============================
 correct unicode character class ("^([\x{4E00}-\x{9FA5}]+)(\d+)(#?)$")
 failure: Unexpected end of regex when ascii character.
 =====================
 named character class ("^([\p{InCJK_Unified_Ideographs}]+)(\d+)(#?)$")
 is_match: false

正则表达式提升库 regex_search 匹配错误

regular expression boost library regex_search match error

c++

boost

Rely on wchar_t

Use a Unicode Aware Regular Expression Type.

题外话？

修复

清单