在 removing/replacing 非单词字符时处理 unicode

Question

我想从具有 unicode（非 ASCII）字符的字符串中删除标点符号和符号（通常是非单词字符）。

例如New $Orléans 到 New Orléans 或 NewOrléans（如果删除 space）

到目前为止我遇到的方法使用 \W 或 \w（参见 PHP strip punctuation）

我面临的挑战是保留 unicode。如果我使用 \W Île-de-France 上的 Î 得到 removed/replaced:

preg_replace('/\W+/', "-", 'Île-de-France') 给出 -le-de-France

是否可以删除非单词字符并仍然处理属于单词字符的非 ASCII 字符？

谢谢。

Answer 1

如果在删除非单词字符时需要最安全的正则表达式来处理 Unicode 字母，请使用

'/[^\p{M}\w]+/u'

见regex demo

关键是 无论如何 你需要 /u 修饰符（使 PCRE 引擎能够将模式和字符串视为 Unicode 字符串），并且 \W 不匹配 组合标记 .

如果您不需要担心组合标记，您可以使用 '/\W+/u' 正则表达式来删除非单词字符。

另外，请参阅 /u modifier reference：

u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8.

在 removing/replacing 非单词字符时处理 unicode

Handle unicode while removing/replacing non-word characters

php

regex

string

unicode

preg-replace