PHP:如何匹配一系列 unicode 配对代理 emoticons/emoji?
PHP: How to match a range of unicode paired surrogates emoticons/emoji?
anubhava's answer about matching ranges of unicode characters led me to the regex to use for cleaning up a specific range of single code point of characters. With it, now I can match all miscellaneous symbols in this list(包括表情)用这个简单的表达:
preg_replace('/[\x{2600}-\x{26FF}]/u', '', $str);
不过,我也想匹配这个 list of paired/double surrogates emoji, but as nhahtdh explained in a comment:
There is a range from d800
to dfff
to specify surrogates in UTF-16 to allow for more characters to be specified. A single surrogate is not a valid character in UTF-16 (a pair is necessary to specify a valid character).
所以,例如,当我尝试这样做时:
preg_replace('/\x{D83D}\x{DE00}/u', '', $str);
用于仅替换第一个 paired surrogates on this list,即:
PHP 抛出这个:
preg_replace()
: Compilation failed: disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
我尝试了几种不同的组合,包括 UTF8 for ('/[\x{00F0}\x{009F}\x{0098}\x{0080}]/u'
), but I was still unable to match it. I also looked into other PCRE pattern modifiers 中上述代码点的假定组合,但似乎 u
是唯一允许通过 UTF8 指向的组合。
我是否遗漏了任何 "escape" 替代方案?
对找到解决方案很有帮助:
If your PHP isn't shipped with a PCRE build for UTF-16 then you can't perform such a match. From PHP 7.0 on, you're able to use Unicode code points following this syntax \u{XXXX}
e.g. preg_replace("~\u{1F600}~", '', $str);
(Mind the double quotes)
由于我使用的是PHP 7,echo "\u{1F602}";
根据这个PHP RFC page on unicode escape输出。该提案实质上是:
A new escape sequence is added for double-quoted strings and heredocs.
\u{ codepoint-digits }
where codepoint-digits
is composed of hexadecimal digits.
这意味着 preg_replace
中的匹配字符串(通常是单引号,以免混淆双引号字符串变量扩展),现在需要一些 preg_quote
magic。这是我想出的解决方案:
preg_replace(
// single point unicode list
"/[\x{2600}-\x{26FF}".
// http://www.fileformat.info/info/unicode/block/miscellaneous_symbols/list.htm
// concatenates with paired surrogates
preg_quote("\u{1F600}", '/')."-".preg_quote("\u{1F64F}", '/').
// https://www.fileformat.info/info/unicode/block/emoticons/list.htm
"]/u",
'',
$str
);
这是 proof of the above in 3v4l。
编辑:更简单的解决方案
在中,似乎直接将unicode字符放入正则表达式字符class中,支持单引号字符串和以前的PHP版本(例如4.3.4) :
preg_replace('/[☀-⛿-]/u','YOINK',$str);
为了使用PHP 7's new feature though,你还需要双引号:
preg_replace("/[\u{2600}-\u{26FF}\u{1F600}-\u{1F64F}]/u",'YOINK',$str);
anubhava's answer about matching ranges of unicode characters led me to the regex to use for cleaning up a specific range of single code point of characters. With it, now I can match all miscellaneous symbols in this list(包括表情)用这个简单的表达:
preg_replace('/[\x{2600}-\x{26FF}]/u', '', $str);
不过,我也想匹配这个 list of paired/double surrogates emoji, but as nhahtdh explained in a comment:
There is a range from
d800
todfff
to specify surrogates in UTF-16 to allow for more characters to be specified. A single surrogate is not a valid character in UTF-16 (a pair is necessary to specify a valid character).
所以,例如,当我尝试这样做时:
preg_replace('/\x{D83D}\x{DE00}/u', '', $str);
用于仅替换第一个 paired surrogates on this list,即:
PHP 抛出这个:
preg_replace()
: Compilation failed: disallowed Unicode code point(>= 0xd800 && <= 0xdfff)
我尝试了几种不同的组合,包括 UTF8 for ('/[\x{00F0}\x{009F}\x{0098}\x{0080}]/u'
), but I was still unable to match it. I also looked into other PCRE pattern modifiers 中上述代码点的假定组合,但似乎 u
是唯一允许通过 UTF8 指向的组合。
我是否遗漏了任何 "escape" 替代方案?
If your PHP isn't shipped with a PCRE build for UTF-16 then you can't perform such a match. From PHP 7.0 on, you're able to use Unicode code points following this syntax
\u{XXXX}
e.g.preg_replace("~\u{1F600}~", '', $str);
(Mind the double quotes)
由于我使用的是PHP 7,echo "\u{1F602}";
根据这个PHP RFC page on unicode escape输出。该提案实质上是:
A new escape sequence is added for double-quoted strings and heredocs.
\u{ codepoint-digits }
wherecodepoint-digits
is composed of hexadecimal digits.
这意味着 preg_replace
中的匹配字符串(通常是单引号,以免混淆双引号字符串变量扩展),现在需要一些 preg_quote
magic。这是我想出的解决方案:
preg_replace(
// single point unicode list
"/[\x{2600}-\x{26FF}".
// http://www.fileformat.info/info/unicode/block/miscellaneous_symbols/list.htm
// concatenates with paired surrogates
preg_quote("\u{1F600}", '/')."-".preg_quote("\u{1F64F}", '/').
// https://www.fileformat.info/info/unicode/block/emoticons/list.htm
"]/u",
'',
$str
);
这是 proof of the above in 3v4l。
编辑:更简单的解决方案
在
preg_replace('/[☀-⛿-]/u','YOINK',$str);
为了使用PHP 7's new feature though,你还需要双引号:
preg_replace("/[\u{2600}-\u{26FF}\u{1F600}-\u{1F64F}]/u",'YOINK',$str);