preg_replace 个 unicode 字符
preg_replace unicode characters
我有几个包含 unicode 的字符串。我的任务是从这些字符串中删除除 unicode 之外的所有内容,例如,下面
\ud83d\ude82 + \u2600\ufe0f = \ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29
会变成
\ud83d\ude82 \u2600\ufe0f \ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29
然后我需要查找重复代码,并将它们分开,以便:
\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29
变为:
\ud83d\ude29 \ud83d\ude29 \ud83d\ude29 \ud83d\ude29 \ud83d\ude29
我已经为第一位尝试了几种 preg_match 解决方案,但它要么不从字符串中删除任何字符,要么删除所有内容。下面是最新的尝试,
/(^\\u[0-9a-f]{4})+/
我不太熟悉正则表达式,我开始困惑地挠头,因为我不确定还能尝试什么。
这样一来,我最终能够将每个 unicode 作为自己的记录插入到数据库中。
可以分两步完成:
$str = '\ud83d\ude82 + \u2600\ufe0f = \ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29';
// remove non unicode character
$str = preg_replace('/(?<=\\u[a-f0-9]{4})[^\\]+/', '', $str);
// insert space between repeated pair
$str = preg_replace('/((?:\\u[a-f0-9]{4}){2})(?=)/', ' ', $str);
echo $str,"\n";
输出:
\ud83d\ude82\u2600\ufe0f\ud83d\ude29 \ud83d\ude29 \ud83d\ude29 \ud83d\ude29 \ud83d\ude29
正则表达式 #1:
/ : regex delimiter
(?<= : lookahead
\\u[a-f0-9]{4} : unicode character
) : end lookahead
[^\\]+ : 1 or more any character that is NOT a backslash
/ : regex delimiter
正则表达式 #2:
/ : regex delimiter
( : start group 1
(?: : non capture group
\\u[a-f0-9]{4} : a unicode character
){2} : appears twice (2 unicode characters)
) : end group 1
(?=) : lookahead, group 1 is repeated
/ : regex delimiter
我有几个包含 unicode 的字符串。我的任务是从这些字符串中删除除 unicode 之外的所有内容,例如,下面
\ud83d\ude82 + \u2600\ufe0f = \ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29
会变成
\ud83d\ude82 \u2600\ufe0f \ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29
然后我需要查找重复代码,并将它们分开,以便:
\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29
变为:
\ud83d\ude29 \ud83d\ude29 \ud83d\ude29 \ud83d\ude29 \ud83d\ude29
我已经为第一位尝试了几种 preg_match 解决方案,但它要么不从字符串中删除任何字符,要么删除所有内容。下面是最新的尝试,
/(^\\u[0-9a-f]{4})+/
我不太熟悉正则表达式,我开始困惑地挠头,因为我不确定还能尝试什么。
这样一来,我最终能够将每个 unicode 作为自己的记录插入到数据库中。
可以分两步完成:
$str = '\ud83d\ude82 + \u2600\ufe0f = \ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29\ud83d\ude29';
// remove non unicode character
$str = preg_replace('/(?<=\\u[a-f0-9]{4})[^\\]+/', '', $str);
// insert space between repeated pair
$str = preg_replace('/((?:\\u[a-f0-9]{4}){2})(?=)/', ' ', $str);
echo $str,"\n";
输出:
\ud83d\ude82\u2600\ufe0f\ud83d\ude29 \ud83d\ude29 \ud83d\ude29 \ud83d\ude29 \ud83d\ude29
正则表达式 #1:
/ : regex delimiter
(?<= : lookahead
\\u[a-f0-9]{4} : unicode character
) : end lookahead
[^\\]+ : 1 or more any character that is NOT a backslash
/ : regex delimiter
正则表达式 #2:
/ : regex delimiter
( : start group 1
(?: : non capture group
\\u[a-f0-9]{4} : a unicode character
){2} : appears twice (2 unicode characters)
) : end group 1
(?=) : lookahead, group 1 is repeated
/ : regex delimiter