Python 3、如何从字符串中删除所有非UTF8字符？

Question

我正在使用 Python 3.7。如何从字符串中删除所有非 UTF-8 字符？我尝试在下面

中使用 "lambda x: x.decode('utf-8','ignore').encode("utf-8")"

coop_types = map(
    lambda x: x.decode('utf-8','ignore').encode("utf-8"),
    filter(None, set(d['type'] for d in input_file))
)

但这会导致错误...

Traceback (most recent call last):
  File "scripts/parse_coop_csv.py", line 30, in <module>
    for coop_type in coop_types:
  File "scripts/parse_coop_csv.py", line 25, in <lambda>
    lambda x: x.decode('utf-8','ignore').encode("utf-8"),
AttributeError: 'str' object has no attribute 'decode'

如果您有从字符串中删除所有非 UTF8 字符的通用方法，这就是我要找的。

Answer 1

您从字符串开始。你不能 decode a str （它已经解码文本，你只能再次将其编码为二进制数据）。 UTF-8 编码几乎所有有效的 Unicode 文本（这是 str 存储的）所以这不应该出现太多，但如果你在输入中遇到 surrogate characters，你可以反转方向, 变化:

x.decode('utf-8','ignore').encode("utf-8")

至：

x.encode('utf-8','ignore').decode("utf-8")

你对任何 UTF-8 编码的东西进行编码，丢弃不可编码的东西，然后解码现在干净的 UTF-8 字节。

Python 3、如何从字符串中删除所有非UTF8字符？

In Python 3, how do you remove all non-UTF8 characters from a string?

python

encode

decode

utf-8

python-3.x