PHP 用于检查字符串是否来自单个脚本的正则表达式
PHP regex for checking if a string comes from single script whatever it was
M3AAWG(消息、恶意软件和移动反滥用工作组)为 "Best Practices for Unicode Abuse Prevention" 提供了说明:
我想执行这个检查:用户名字段(PHP)满足这个条件:
All characters in each identifier must be **from a single script** or from the combinations:
- Latin + Han + Hiragana + Katakana
- Latin + Han + Bopomofo
- Latin + Han + Hangul
虽然组合部分看起来很容易编码:
$nonLatinHanHiraganaKatakana = preg_match_all('/[^\p{Common}\p{Latin}\p{Han}\p{Hiragana}\p{Katakana}]/u', $value);
$nonLatinHanBopomofo = preg_match_all('/[^\p{Common}\p{Latin}\p{Han}\p{Bopomofo}]/u', $value);
$nonLatinHanHangul = preg_match_all('/[^\p{Common}\p{Latin}\p{Han}\p{Hangul}]/u', $value);
// If none of the allowed combinations by M3AAWG, then reject
if ($nonLatinHanHiraganaKatakana && $nonLatinHanBopomofo && $nonLatinHanHangul)
{
// REJECT
}
有没有简单的方法来检查字符串是否来自单个脚本? (仅表示 {latin}、{greek} 等...不管是什么?
琐碎的工作人员似乎正在逐个检查每个脚本,但是有数百个,而且看起来不实用,也没有性能价值。也许有一种方法可以得到 'script' 给定 'username' 中的每个字符?那么检查是否有多个?
Are there an easy way to check if a string comes from a single script?
(meaning only {latin}, {greek}, etc... whatever it was?
对 "every" 可能的选项使用 switch
,即:
<?php
echo detectChars("例例")."\n";
echo detectChars("Χαίρετε")."\n";
echo detectChars("user345")."\n";
function detectChars($username){
$chars = "NOT ALLOWED";
switch(true){
case preg_match('/^\p{Common}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Common}");
break;
case preg_match('/^\p{Arabic}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Arabic}");
break;
case preg_match('/^\p{Armenian}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Armenian}");
break;
case preg_match('/^\p{Bengali}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Bengali}");
break;
case preg_match('/^\p{Bopomofo}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Bopomofo}");
break;
case preg_match('/^\p{Braille}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Braille}");
break;
case preg_match('/^\p{Buhid}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Buhid}");
break;
case preg_match('/^\p{Canadian_Aboriginal}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Canadian_Aboriginal}");
break;
case preg_match('/^\p{Cherokee}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Cherokee}");
break;
case preg_match('/^\p{Cyrillic}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Cyrillic}");
break;
case preg_match('/^\p{Devanagari}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Devanagari}");
break;
case preg_match('/^\p{Ethiopic}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Ethiopic}");
break;
case preg_match('/^\p{Georgian}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Georgian}");
break;
case preg_match('/^\p{Greek}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Greek}");
break;
case preg_match('/^\p{Gujarati}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Gujarati}");
break;
case preg_match('/^\p{Gurmukhi}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Gurmukhi}");
break;
case preg_match('/^\p{Han}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Han}");
break;
case preg_match('/^\p{Hangul}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Hangul}");
break;
case preg_match('/^\p{Hanunoo}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Hanunoo}");
break;
case preg_match('/^\p{Hebrew}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Hebrew}");
break;
case preg_match('/^\p{Hiragana}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Hiragana}");
break;
case preg_match('/^\p{Inherited}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Inherited}");
break;
case preg_match('/^\p{Kannada}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Kannada}");
break;
case preg_match('/^\p{Katakana}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Katakana}");
break;
case preg_match('/^\p{Khmer}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Khmer}");
break;
case preg_match('/^\p{Lao}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Lao}");
break;
case preg_match('/^\p{Latin}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Latin}");
break;
case preg_match('/^\p{Limbu}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Limbu}");
break;
case preg_match('/^\p{Malayalam}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Malayalam}");
break;
case preg_match('/^\p{Mongolian}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Mongolian}");
break;
case preg_match('/^\p{Myanmar}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Myanmar}");
break;
case preg_match('/^\p{Ogham}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Ogham}");
break;
case preg_match('/^\p{Oriya}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Oriya}");
break;
case preg_match('/^\p{Runic}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Runic}");
break;
case preg_match('/^\p{Sinhala}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Sinhala}");
break;
case preg_match('/^\p{Syriac}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Syriac}");
break;
case preg_match('/^\p{Tagalog}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Tagalog}");
break;
case preg_match('/^\p{Tagbanwa}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Tagbanwa}");
break;
case preg_match('/^\p{Tamil}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Tamil}");
break;
case preg_match('/^\p{Telugu}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Telugu}");
break;
case preg_match('/^\p{Thaana}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Thaana}");
break;
case preg_match('/^\p{Thai}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Thai}");
break;
case preg_match('/^\p{Tibetan}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Tibetan}");
break;
case preg_match('/^\p{Yi} +$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Yi}");
break;
}
return $chars;
}
输出:
Han
Greek
NOT ALLOWED
他们肯定不只是一群
打击。如果他们是标准
承载者,让他们提供 list
您或任何人允许的组合
可以用来构造一个正则表达式。和,
做一个没有错
单个正则表达式。
^(?>[\p{Latin}\p{Han}\p{Hiragana}\p{Katakana}]+|[\p{Latin}\p{Han}\p{Bopomofo}]+|[\p{Latin}\p{Han}\p{Hangul}]+)$
展开
^
(?>
[\p{Latin}\p{Han}\p{Hiragana}\p{Katakana}]+
|
[\p{Latin}\p{Han}\p{Bopomofo}]+
|
[\p{Latin}\p{Han}\p{Hangul}]+
# Add as many as you need
# | \p{..}
)
$
M3AAWG(消息、恶意软件和移动反滥用工作组)为 "Best Practices for Unicode Abuse Prevention" 提供了说明:
我想执行这个检查:用户名字段(PHP)满足这个条件:
All characters in each identifier must be **from a single script** or from the combinations:
- Latin + Han + Hiragana + Katakana
- Latin + Han + Bopomofo
- Latin + Han + Hangul
虽然组合部分看起来很容易编码:
$nonLatinHanHiraganaKatakana = preg_match_all('/[^\p{Common}\p{Latin}\p{Han}\p{Hiragana}\p{Katakana}]/u', $value);
$nonLatinHanBopomofo = preg_match_all('/[^\p{Common}\p{Latin}\p{Han}\p{Bopomofo}]/u', $value);
$nonLatinHanHangul = preg_match_all('/[^\p{Common}\p{Latin}\p{Han}\p{Hangul}]/u', $value);
// If none of the allowed combinations by M3AAWG, then reject
if ($nonLatinHanHiraganaKatakana && $nonLatinHanBopomofo && $nonLatinHanHangul)
{
// REJECT
}
有没有简单的方法来检查字符串是否来自单个脚本? (仅表示 {latin}、{greek} 等...不管是什么?
琐碎的工作人员似乎正在逐个检查每个脚本,但是有数百个,而且看起来不实用,也没有性能价值。也许有一种方法可以得到 'script' 给定 'username' 中的每个字符?那么检查是否有多个?
Are there an easy way to check if a string comes from a single script? (meaning only {latin}, {greek}, etc... whatever it was?
对 "every" 可能的选项使用 switch
,即:
<?php
echo detectChars("例例")."\n";
echo detectChars("Χαίρετε")."\n";
echo detectChars("user345")."\n";
function detectChars($username){
$chars = "NOT ALLOWED";
switch(true){
case preg_match('/^\p{Common}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Common}");
break;
case preg_match('/^\p{Arabic}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Arabic}");
break;
case preg_match('/^\p{Armenian}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Armenian}");
break;
case preg_match('/^\p{Bengali}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Bengali}");
break;
case preg_match('/^\p{Bopomofo}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Bopomofo}");
break;
case preg_match('/^\p{Braille}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Braille}");
break;
case preg_match('/^\p{Buhid}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Buhid}");
break;
case preg_match('/^\p{Canadian_Aboriginal}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Canadian_Aboriginal}");
break;
case preg_match('/^\p{Cherokee}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Cherokee}");
break;
case preg_match('/^\p{Cyrillic}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Cyrillic}");
break;
case preg_match('/^\p{Devanagari}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Devanagari}");
break;
case preg_match('/^\p{Ethiopic}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Ethiopic}");
break;
case preg_match('/^\p{Georgian}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Georgian}");
break;
case preg_match('/^\p{Greek}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Greek}");
break;
case preg_match('/^\p{Gujarati}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Gujarati}");
break;
case preg_match('/^\p{Gurmukhi}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Gurmukhi}");
break;
case preg_match('/^\p{Han}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Han}");
break;
case preg_match('/^\p{Hangul}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Hangul}");
break;
case preg_match('/^\p{Hanunoo}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Hanunoo}");
break;
case preg_match('/^\p{Hebrew}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Hebrew}");
break;
case preg_match('/^\p{Hiragana}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Hiragana}");
break;
case preg_match('/^\p{Inherited}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Inherited}");
break;
case preg_match('/^\p{Kannada}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Kannada}");
break;
case preg_match('/^\p{Katakana}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Katakana}");
break;
case preg_match('/^\p{Khmer}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Khmer}");
break;
case preg_match('/^\p{Lao}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Lao}");
break;
case preg_match('/^\p{Latin}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Latin}");
break;
case preg_match('/^\p{Limbu}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Limbu}");
break;
case preg_match('/^\p{Malayalam}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Malayalam}");
break;
case preg_match('/^\p{Mongolian}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Mongolian}");
break;
case preg_match('/^\p{Myanmar}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Myanmar}");
break;
case preg_match('/^\p{Ogham}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Ogham}");
break;
case preg_match('/^\p{Oriya}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Oriya}");
break;
case preg_match('/^\p{Runic}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Runic}");
break;
case preg_match('/^\p{Sinhala}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Sinhala}");
break;
case preg_match('/^\p{Syriac}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Syriac}");
break;
case preg_match('/^\p{Tagalog}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Tagalog}");
break;
case preg_match('/^\p{Tagbanwa}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Tagbanwa}");
break;
case preg_match('/^\p{Tamil}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Tamil}");
break;
case preg_match('/^\p{Telugu}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Telugu}");
break;
case preg_match('/^\p{Thaana}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Thaana}");
break;
case preg_match('/^\p{Thai}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Thai}");
break;
case preg_match('/^\p{Tibetan}+$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Tibetan}");
break;
case preg_match('/^\p{Yi} +$/u', $username):
$chars = preg_replace('/\\p|\{|\}/m', '', "\p{Yi}");
break;
}
return $chars;
}
输出:
Han
Greek
NOT ALLOWED
他们肯定不只是一群 打击。如果他们是标准 承载者,让他们提供 list 您或任何人允许的组合 可以用来构造一个正则表达式。和, 做一个没有错 单个正则表达式。
^(?>[\p{Latin}\p{Han}\p{Hiragana}\p{Katakana}]+|[\p{Latin}\p{Han}\p{Bopomofo}]+|[\p{Latin}\p{Han}\p{Hangul}]+)$
展开
^
(?>
[\p{Latin}\p{Han}\p{Hiragana}\p{Katakana}]+
|
[\p{Latin}\p{Han}\p{Bopomofo}]+
|
[\p{Latin}\p{Han}\p{Hangul}]+
# Add as many as you need
# | \p{..}
)
$