如何正确地将十六进制转义添加到字符串文字中?

How to properly add hex escapes into a string-literal?

当你在C中有字符串时,你可以直接在里面添加十六进制代码。

char str[] = "abcde"; // 'a', 'b', 'c', 'd', 'e', 0x00
char str2[] = "abc\x12\x34"; // 'a', 'b', 'c', 0x12, 0x34, 0x00

两个例子在内存中都有 6 个字节。现在,如果您想在输入十六进制后添加值 [a-fA-F0-9],就会出现问题。

//I want: 'a', 'b', 'c', 0x12, 'e', 0x00
//Error, hex is too big because last e is treated as part of hex thus becoming 0x12e
char problem[] = "abc\x12e";

可能的解决方法是在定义后替换。

//This will work, bad idea
char solution[6] = "abcde";
solution[3] = 0x12;

这可以工作,但如果你把它写成 const,它会失败。

//This will not work
const char solution[6] = "abcde";
solution[3] = 0x12; //Compilation error!

如何在\x12之后正确插入e而不触发错误?


我为什么要问?当您想将 UTF-8 字符串构建为常量时,如果它大于 ASCII table 可以容纳的字符,则必须使用字符的十六进制值。

使用 3 个八进制数字:

char problem[] = "abc2e";

或拆分您的字符串:

char problem[] = "abc\x12" "e";

为什么这些有效:

  • 与十六进制转义不同,标准将 3 位数字定义为八进制转义的最大数量。

    6.4.4.4 Character constants

    ...

    octal-escape-sequence:
        \ octal-digit
        \ octal-digit octal-digit
        \ octal-digit octal-digit octal-digit
    

    ...

    hexadecimal-escape-sequence:
        \x hexadecimal-digit
        hexadecimal-escape-sequence hexadecimal-digit
    
  • 字符串文字连接被定义为比文字转义字符转换晚的翻译阶段。

    5.1.1.2 Translation phases

    ...

    1. Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation- defined member other than the null (wide) character. 8)

    2. Adjacent string literal tokens are concatenated.

由于字符串文字在编译过程的早期就被连接起来,但是 转义字符转换之后,您可以只使用:

char problem[] = "abc\x12" "e";

尽管为了便于阅读,您可能更喜欢完全分离:

char problem[] = "abc" "\x12" "e";

对于我们当中的语言律师,C11 5.1.1.2 Translation phases(我强调)中涵盖了这一点:

  1. Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.

  2. Adjacent string literal tokens are concatenated.

Why I'm asking? When you want to build UTF-8 string as constant, you have to use hex values of character is larger than ASCII table can hold.

嗯,不。您 不必 。从 C11 开始,您可以在字符串常量前加上 u8,这告诉编译器字符文字是 UTF-8。

char solution[] = u8"no need to use hex-codes áé§µ";

(顺便说一下,C++11 也支持同样的事情)