如何将代码点 32 位整数数组（UTF-32？）转换为 Windows 本机字符串？

Question

如何将代码点 32 位整数数组（UTF-32？）转换为 Windows 原生字符串？在 API 级别处理 Unicode 的 Windows 本机字符串类型是什么？它能正确处理'u65535'以外的字符吗？

Answer 1

Windows 使用 UTF-16 作为其原生字符串类型。 UTF-16 最多可处理 U+10FFFF 的代码点，使用 代理对 对 U+FFFF.

以上的代码点进行编码

Windows 没有 UTF-32 的概念，因此您必须：

如果您使用的是 C++11 或更高版本，它有原生 std::u16string 和 std::u32string 类型，以及 std::codecvt 类用于转换UTF-8、UTF-16 和 UTF-32 之间的数据。

#include <string>
#include <locale>

std::u16string Utf32ToUtf16(const u32string &codepoints)
{
    std::wstring_convert<
        std::codecvt_utf16<char32_t, 0x10ffff, std::little_endian>
        char32_t> conv;
    std::string bytes = conv.to_bytes(codepoints);
    return std::u16string(reinterpret_cast<char16_t*>(bytes.c_str()), bytes.length() / sizeof(char16_t));
}

如果您使用的是较早的 C/C++ 版本，则必须手动将 UTF-32 转换为 UTF-16：

// on Windows, wchar_t is 2 bytes, suitable for UTF-16
std::wstring Utf32ToUtf16(const std::vector<uint32_t> &codepoints)
{
    std::wstring result;
    int len = 0;

    for (std::vector<uint32_t>::iterator iter = codepoints.begin(); iter != codepoints.end(); ++iter)
    {
        uint32_t cp = *iter;
        if (cp < 0x10000) {
            ++len;
        }
        else if (cp <= 0x10FFFF) {
            len += 2;
        }
        else {
            // invalid code_point, do something !
            ++len;
        }
    }

    if (len > 0)
    {
        result.resize(len);
        len = 0;

        for (std::vector<uint32_t>::iterator iter = codepoints.begin(); iter != codepoints.end(); ++iter)
        {
            uint32_t cp = *iter;
            if (cp < 0x10000) {
                result[len++] = static_cast<wchar_t>(cp);
            }
            else if (cp <= 0x10FFFF) {
                cp -= 0x10000;
                result[len++] = static_cast<wchar_t>((cp >> 10) + 0xD800);
                result[len++] = static_cast<wchar_t>((cp & 0x3FF) + 0xDC00);
            }
            else {
                result[len++] = static_cast<wchar_t>(0xFFFD);
            }
        }
    }

    return result;
}

使用第 3 方库，例如 libiconv or ICU。

如何将代码点 32 位整数数组（UTF-32？）转换为 Windows 本机字符串？

How to convert a codepoint 32bit integer array (UTF-32?) to Windows native string?

c

windows

unicode

winapi

visual-c++