用于合并多个规则的正则表达式

Regex to consolidate multiple rules

我正在考虑优化我的字符串操作代码,并尽可能将我的所有 replaceAll 合并为一个模式

规则 -

我的代码 -

public static String slugifyTitle(String value) {
    String slugifiedVal = null;
    if (StringUtils.isNotEmpty(value))
        slugifiedVal = value
                .replaceAll("[ ](?=[ ])|[^-A-Za-z0-9 ]+", "") // strips all special chars except -
                .replaceAll("\s+", "-") // converts spaces to -
                .replaceAll("--+", "-"); // replaces consecutive -'s with just one -

    slugifiedVal = StringUtils.stripStart(slugifiedVal, "-"); // strips leading -
    slugifiedVal = StringUtils.stripEnd(slugifiedVal, "-"); // strips trailing -

    return slugifiedVal;
}

做的很好,但明显看起来很粗糙。

我的测试断言-

Heading with symbols *~!@#$%^&()_+-=[]{};',.<>?/ ==> heading-with-symbols
    
Heading with an asterisk* ==> heading-with-an-asterisk
    
Custom-id-&-stuff ==> custom-id-stuff
    
--Custom-id-&-stuff-- ==> custom-id-stuff

考虑以下正则表达式部分:

  • -以外的任何特殊字符:[\p{S}\p{P}&&[^-]]+(字符class减法)
  • 任何一个或多个空格或连字符:[^-\s]+(这将用于替换为单个 -
  • 您仍然需要删除 leading/trailing 连字符,这将是一个单独的 post 处理步骤。如果您愿意,可以使用 ^-+|-+$ 正则表达式。

因此,您只能将其减少到三个 .replaceAll 调用以保持代码的精确性和可读性:

public static String slugifyTitle(String value) {
    String slugifiedVal = null;
    if (value != null && !value.trim().isEmpty())
        slugifiedVal = value.toLowerCase()
                .replaceAll("[\p{S}\p{P}&&[^-]]+", "") // strips all special chars except -
                .replaceAll("[\s-]+", "-") // converts spaces/hyphens to -
                .replaceAll("^-+|-+$", ""); // remove trailing/leading hyphens
    return slugifiedVal;
}

Java demo:

List<String> strs = Arrays.asList("Heading with symbols *~!@#$%^&()_+-=[]{};',.<>?/",
        "Heading with an asterisk*",
        "Custom-id-&-stuff",
        "--Custom-id-&-stuff--");
for (String str : strs)
    System.out.println("\"" + str + "\" => " + slugifyTitle(str));
}

输出:

"Heading with symbols *~!@#$%^&()_+-=[]{};',.<>?/" => heading-with-symbols
"Heading with an asterisk*" => heading-with-an-asterisk
"Custom-id-&-stuff" => custom-id-stuff
"--Custom-id-&-stuff--" => custom-id-stuff

注意:如果您的字符串可以包含任何 Unicode 空格,请将 "[\s-]+" 替换为 "(?U)[\s-]+"

免责声明:我不认为针对此问题的正则表达式方法是错误的,或者这是客观上更好的方法。我只是提出一种可供思考的替代方法。

我倾向于反对正则表达式方法来解决你必须询问如何用正则表达式解决的问题,因为这意味着你将很难在未来。正则表达式是不透明的,当你知道只做这个时,“只做这个”是显而易见的。

一些通常使用正则表达式解决的问题(例如这个问题)可以使用命令式代码来解决。它往往更冗长,但它使用简单、明显的代码结构;更容易调试;并且可以更快,因为它不涉及正则表达式引擎的完整“机器”。


static String slugifyTitle(String value) {
    boolean appendHyphen = false;
    StringBuilder sb = new StringBuilder(value.length());

    // Go through value one character at a time...
    for (int i = 0; i < value.length(); i++) {
      char c = value.charAt(i);

      if (isAppendable(c)) {
        // We have found a character we want to include in the string.

        if (appendHyphen) {
          // We previously found character(s) that we want to append a single
          // hyphen for.
          sb.append('-');
          appendHyphen = false;
        }
        sb.append(c);
      } else if (requiresHyphen(c)) {
        // We want to replace hyphens or spaces with a single hyphen.
        // Only append a hyphen if it's not going to be the first thing in the output.
        // Doesn't matter if this is set for trailing hyphen/whitespace,
        // since we then never hit the "isAppendable" condition.
        appendHyphen = sb.length() > 0;
      } else {
        // Other characters are simply ignored.
      }
    }

    // You can lowercase when appending the character, but `Character.toLowerCase()`
    // recommends using `String.toLowerCase` instead.
    return sb.toString().toLowerCase(Locale.ROOT);
}

// Some predicate on characters you want to include in the output.
static boolean isAppendable(char c) {
  return (c >= 'A' && c <= 'Z')
      || (c >= 'a' && c <= 'z')
      || (c >= '0' && c <= '9');
}

// Some predicate on characters you want to replace with a single '-'.
static boolean requiresHyphen(char c) {
  return c == '-' || Character.isWhitespace(c);
}

(这段代码被过度注释,目的是在这个答案中解释它。去掉注释和不必要的东西,比如 else,它实际上并不复杂)。