从 Google 翻译中删除所有意外字符

Question

我正在使用 Google translate 来翻译一些文本。

有时，Google 译者会在翻译文本中添加不可打印的字符。

例如，转到此页面： https://www.google.com/search?client=ubuntu&channel=fs&q=traduttore&ie=utf-8&oe=utf-8

选择从意大利语到英语并翻译 leone marino。

结果将是：

sea lion
   ^ here there are other two non-printable chars, exactly before the "l" char

您可以通过将文本放在任何您可以更改的地方（例如在文本编辑器中或在任何网页的文本字段中，甚至在浏览器中 url）并随着键盘箭头你会注意到光标会在 space.

字符附近停两次

撇开插入这些字符的原因不谈，如何使用正则表达式和 PHP and/or 使用 sublime text 删除所有这些不可打印的字符？

以及，如何查看这些字符的 unicode 版本？

Answer 1

要删除所有 other format Unicode chars 您可以使用

$s = preg_replace('~\p{Cf}+~u', '', $s);

既然你想删除一个zero-width space，你可以只使用

$s = str_replace("\u{200B}", "", $s);

我使用 https://r12a.github.io/app-conversion/（无从属关系）来检查字符串中的隐藏字符：

可能的 PHP 代码将字符串转换为 \uXXXX 表示以快速查看 non-ASCII 个字符的 Unicode 代码点：

$input = "sea lion";
echo preg_replace_callback('#[^ -~]#u', function($m) {
    return substr(json_encode($m[0]), 1, -1);
}, $input); 
// => sea \u200b\u200blion

从 Google 翻译中删除所有意外字符

remove all unexpected chars from Google translate

php

regex

encoding

sublimetext3