UTF-8、Unicode 以及机器如何解释字节?
UTF-8, Unicode and how does machine interprete bytes?
以下是我的体会:
- Unicode character can be represented as up to 4 bytes sequence. So, if a character is represented in two or more bytes - byte ordering
is important regarding to BEM or LEM
- UTF-8 write bytes into file/network stream byte by byte (not multibytes writing or reading) that means if a character is
represented in two or more bytes, while encoding it writes one byte at
time. Then it does not matter BEM or LEM while decoding it always
reads back bytes correctly and does not swap them when writing or
reading.
- UTF-16 or UTF-32 use always two or four bytes while encoding, so LEM or BEM now really matter because of multibytes reading/writing.
- In addition, I understand how UTF-8 knows to interpret bytes as a character while reading from a file (decoding).
所以。这是示例:
我在 C++ 中将 String
变量声明并初始化为 "ANФГ"
。
问题。
- 在 C++ 中
char
是一种单字节字符数据类型。 String
class 基于 C++ 中的 char[]
?
- 我可以这样声明一个字符串变量吗? UTF-8 编码是默认的?
- 我决定把这个字符串写入一个文件。该字符串应表示为 A - 一个字节,B - 一个字节,Ф - 两个字节序列,Γ - 两个字节序列。它将如何存储在 String 和文件中?这 6 个字节的地址是什么?
- 如何从有关 BEM 和 LEM 的文件中读取? C++ 知道这些字节存储在内存中的地址顺序吗?
EDIT_1: I dont understand one thing. If I have three bytes:
- 1000 1111
- 1100 0000
- 0100 0000
The first one and the second one represent one character in UTF-8, the third one represents one as well.
The order of bytes is I wrote above.
Every byte has his own address, right?
But when multibytes writing happen two bytes are stored at one place? I mean, any output stream writes data in order left-to-right? Then it will be read back left-to-right as well? Because LEM or BEM swap bytes.. but when it is multibytes writing. But when we write only one byte at time it has his own correct order left-to-right?
- 是的,
std::string
(或者更确切地说,std::basic_string<char>
)使用 char
来存储其数据。它与编码无关,因此如果您调用 size()
,您将获得代表字符串的 char
的实际数量,而不是字符或代码点的数量。
- 不,字符串文字的编码是实现定义的。自 C++11 起,您可以使用
u8
前缀来获得 UTF-8 string literals(例如 u8"ANФГ"
)。
- 如果您一直使用 UTF-8 字符串文字,
std::string
将包含 UTF-8,如果您使用的是 UTF-8,则 UTF-8 将被写入文件,例如operator<<()
.
- C++ 不会跟踪您的文件恰好位于其中的任何字符编码(因此也不会跟踪其字节顺序)。如果您恰好端到端使用 UTF-8,则字节顺序无关紧要,因为 UTF-8 is endianness-independent.
以下是我的体会:
- Unicode character can be represented as up to 4 bytes sequence. So, if a character is represented in two or more bytes - byte ordering is important regarding to BEM or LEM
- UTF-8 write bytes into file/network stream byte by byte (not multibytes writing or reading) that means if a character is represented in two or more bytes, while encoding it writes one byte at time. Then it does not matter BEM or LEM while decoding it always reads back bytes correctly and does not swap them when writing or reading.
- UTF-16 or UTF-32 use always two or four bytes while encoding, so LEM or BEM now really matter because of multibytes reading/writing.
- In addition, I understand how UTF-8 knows to interpret bytes as a character while reading from a file (decoding).
所以。这是示例:
我在 C++ 中将 String
变量声明并初始化为 "ANФГ"
。
问题。
- 在 C++ 中
char
是一种单字节字符数据类型。String
class 基于 C++ 中的char[]
? - 我可以这样声明一个字符串变量吗? UTF-8 编码是默认的?
- 我决定把这个字符串写入一个文件。该字符串应表示为 A - 一个字节,B - 一个字节,Ф - 两个字节序列,Γ - 两个字节序列。它将如何存储在 String 和文件中?这 6 个字节的地址是什么?
- 如何从有关 BEM 和 LEM 的文件中读取? C++ 知道这些字节存储在内存中的地址顺序吗?
EDIT_1: I dont understand one thing. If I have three bytes: - 1000 1111 - 1100 0000 - 0100 0000 The first one and the second one represent one character in UTF-8, the third one represents one as well. The order of bytes is I wrote above. Every byte has his own address, right? But when multibytes writing happen two bytes are stored at one place? I mean, any output stream writes data in order left-to-right? Then it will be read back left-to-right as well? Because LEM or BEM swap bytes.. but when it is multibytes writing. But when we write only one byte at time it has his own correct order left-to-right?
- 是的,
std::string
(或者更确切地说,std::basic_string<char>
)使用char
来存储其数据。它与编码无关,因此如果您调用size()
,您将获得代表字符串的char
的实际数量,而不是字符或代码点的数量。 - 不,字符串文字的编码是实现定义的。自 C++11 起,您可以使用
u8
前缀来获得 UTF-8 string literals(例如u8"ANФГ"
)。 - 如果您一直使用 UTF-8 字符串文字,
std::string
将包含 UTF-8,如果您使用的是 UTF-8,则 UTF-8 将被写入文件,例如operator<<()
. - C++ 不会跟踪您的文件恰好位于其中的任何字符编码(因此也不会跟踪其字节顺序)。如果您恰好端到端使用 UTF-8,则字节顺序无关紧要,因为 UTF-8 is endianness-independent.