Unicode 代码点和 Unicode 标量之间有什么区别？

What is the difference between Unicode code points and Unicode scalars?

我看到这两者在很多情况下（看似）可以互换使用 - 它们是相同的还是不同的？这似乎也取决于语言是在谈论 UTF-8（例如 Rust）还是 UTF-16（例如 Java/Haskell）。代码 point/scalar 区别是否在某种程度上取决于编码方案？

先来看看definitions D9, D10 and D10a, Section 3.4, Characters and Encoding:

D9 Unicode codespace: A range of integers from 0 to 10FFFF₁₆.

D10 Code point: Any value in the Unicode codespace.

• A code point is also known as a code position.

...

D10a Code point type: Any of the seven fundamental classes of code points in the standard: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.

[emphasis added]

好的，所以代码点是一定范围内的整数。它们分为称为“代码点类型”的类别。

现在让我们看看definition D76, Section 3.9, Unicode Encoding Forms:

D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.

• As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF₁₆ and E000₁₆ to 10FFFF₁₆, inclusive.

代理在第 3.8 节中定义和解释，就在 D76 之前。故事的要点是代理人分为高代理人和低代理人两类。这些仅由 UTF-16 使用，因此它可以表示所有代码点。（有 1,114,112 个代码点，但 2¹⁶ = 65536 比这少得多。） UTF-8 没有这个问题；它是一种可变长度编码方案（代码点可以是 1-4 个字节长），因此它可以在不使用代理项的情况下容纳所有代码点。

摘要： 代码点是标量或代理项。代码点只是最抽象意义上的数字；该数字如何编码为二进制形式是一个单独的问题。 UTF-16 使用代理对是因为它不能直接表示所有可能的代码点。 UTF-8 不使用代理对。

将来，您可能会发现咨询 Unicode glossary 很有帮助。它包含许多常用定义，以及指向 Unicode 规范中定义的链接。

Unicode 代码点和 Unicode 标量之间有什么区别？

What is the difference between Unicode code points and Unicode scalars?

unicode

terminology