实现一个函数来检查 string/byte 数组是否遵循 utf-8 格式

Question

我正在尝试解决这个面试问题。

After given clearly definition of UTF-8 format. ex: 1-byte : 0b0xxxxxxx 2- bytes:.... Asked to write a function to validate whether the input is valid UTF-8. Input will be string/byte array, output should be yes/no.

我有两种可能的方法。

首先，如果输入的是字符串，由于UTF-8最多4字节，我们去掉前两个字符“0b”后，可以用Integer.parseInt(s)来判断是否是字符串的其余部分在 0 到 10FFFF 范围内。此外，最好先检查字符串的长度是否为 8 的倍数，以及输入字符串是否包含全 0 和 1。所以我将不得不遍历字符串两次，复杂度为 O(n)。

其次，如果输入是一个字节数组（如果输入是一个字符串，我们也可以使用这个方法），我们检查每个1字节的元素是否在正确的范围内。如果输入是字符串，首先检查字符串的长度是 8 的倍数，然后检查每个 8 字符的子字符串是否在范围内。

我知道有几个关于如何使用 Java 库检查字符串的解决方案，但我的问题是我应该如何根据问题实现该功能。

非常感谢。

Answer 1

我们先来看一个visual representation of the UTF-8 design.

现在让我们继续我们要做的事情。

遍历字符串的所有字符（每个字符是一个字节）。
我们需要根据代码点对每个字节应用掩码，因为 x 字符代表实际代码点。我们将使用二进制 AND 运算符 (&)，如果它在两个操作数中都存在，则将其复制到结果中。
应用掩码的目的是删除尾随位，以便我们将实际字节与第一个代码点进行比较。我们将使用 0b1xxxxxxx 进行按位运算，其中 1 将出现 "Bytes in sequence" 次，而其他位将为 0.
然后我们可以与第一个字节进行比较，以验证它是否有效，并确定实际字节是什么。
如果输入的字符是none的大小写，说明该字节无效，我们return "No".
如果我们能跳出循环，那就意味着每个字符都是有效的，因此字符串是有效的。
确保 return 为真的比较对应于预期的长度。

该方法如下所示：

public static final boolean isUTF8(final byte[] pText) {

    int expectedLength = 0;

    for (int i = 0; i < pText.length; i++) {
        if ((pText[i] & 0b10000000) == 0b00000000) {
            expectedLength = 1;
        } else if ((pText[i] & 0b11100000) == 0b11000000) {
            expectedLength = 2;
        } else if ((pText[i] & 0b11110000) == 0b11100000) {
            expectedLength = 3;
        } else if ((pText[i] & 0b11111000) == 0b11110000) {
            expectedLength = 4;
        } else if ((pText[i] & 0b11111100) == 0b11111000) {
            expectedLength = 5;
        } else if ((pText[i] & 0b11111110) == 0b11111100) {
            expectedLength = 6;
        } else {
            return false;
        }

        while (--expectedLength > 0) {
            if (++i >= pText.length) {
                return false;
            }
            if ((pText[i] & 0b11000000) != 0b10000000) {
                return false;
            }
        }
    }

    return true;
}

编辑： 实际方法不是原来的方法（几乎，但不是），是从 here 偷来的。根据@EJP 评论，原始版本无法正常工作。

Answer 2

好的，非常感谢您的评论和回答。首先，我不得不承认这是"another stupid interview question"。的确，在 Java 中，字符串已经编码，因此它始终与 UTF-8 兼容。一种检查方法是给定一个字符串：

public static boolean isUTF8(String s){
    try{
        byte[]bytes = s.getBytes("UTF-8");
    }catch(UnsupportedEncodingException e){
        e.printStackTrace();
        System.exit(-1);
    }
    return true;
}

但是，由于所有可打印的字符串都是 unicode 形式，所以我没有机会得到错误。

其次，如果给定一个字节数组，它将始终在 -2^7(0b10000000) 到 2^7(0b1111111) 的范围内，因此它始终在有效的 UTF-8 范围内。

我最初对这个问题的理解是给定一个字符串，比如“0b11111111”，检查它是否是有效的 UTF-8，我想我错了。

此外，Java确实提供了将字节数组转换为字符串的构造函数，如果您对解码方法感兴趣，请查看here。

还有一点，如果使用另一种语言，上述答案将是正确的。唯一的改进可能是：

In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences.

所以4个字节就足够了。

我是肯定的，如果我错了请纠正我。非常感谢。

Answer 3

CharsetDecoder 可能就是您要找的：

@Test
public void testUTF8() throws CharacterCodingException {
    // the desired charset
    final Charset UTF8 = Charset.forName("UTF-8");
    // prepare decoder
    final CharsetDecoder decoder = UTF8.newDecoder();
    decoder.onMalformedInput(CodingErrorAction.REPORT);
    decoder.onUnmappableCharacter(CodingErrorAction.REPORT);

    byte[] bytes = new byte[48];
    new Random().nextBytes(bytes);
    ByteBuffer buffer = ByteBuffer.wrap(bytes);
    try {
        decoder.decode(buffer);
        fail("Should not be UTF-8");
    } catch (final CharacterCodingException e) {
        // noop, the test should fail here
    }

    final String string = "hallo welt!";
    bytes = string.getBytes(UTF8);
    buffer = ByteBuffer.wrap(bytes);
    final String result = decoder.decode(buffer).toString();
    assertEquals(string, result);
}

因此您的函数可能如下所示：

public static boolean checkEncoding(final byte[] bytes, final String encoding) {
    final CharsetDecoder decoder = Charset.forName(encoding).newDecoder();
    decoder.onMalformedInput(CodingErrorAction.REPORT);
    decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
    final ByteBuffer buffer = ByteBuffer.wrap(bytes);

    try {
        decoder.decode(buffer);
        return true;
    } catch (final CharacterCodingException e) {
        return false;
    }
}

Answer 4

public static boolean validUTF8(byte[] input) {
    int i = 0;
    // Check for BOM
    if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
            && (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
        i = 3;
    }

    int end;
    for (int j = input.length; i < j; ++i) {
        int octet = input[i];
        if ((octet & 0x80) == 0) {
            continue; // ASCII
        }

        // Check for UTF-8 leading byte
        if ((octet & 0xE0) == 0xC0) {
            end = i + 1;
        } else if ((octet & 0xF0) == 0xE0) {
            end = i + 2;
        } else if ((octet & 0xF8) == 0xF0) {
            end = i + 3;
        } else {
            // Java only supports BMP so 3 is max
            return false;
        }

        while (i < end) {
            i++;
            octet = input[i];
            if ((octet & 0xC0) != 0x80) {
                // Not a valid trailing byte
                return false;
            }
        }
    }
    return true;
}

Answer 5

现实世界 UTF-8 兼容性检查的小解决方案：

public static final boolean isUTF8(final byte[] inputBytes) {
    final String converted = new String(inputBytes, StandardCharsets.UTF_8);
    final byte[] outputBytes = converted.getBytes(StandardCharsets.UTF_8);
    return Arrays.equals(inputBytes, outputBytes);
}

您可以查看测试结果：

@Test
public void testEnconding() {

    byte[] invalidUTF8Bytes1 = new byte[]{(byte)0b10001111, (byte)0b10111111 };
    byte[] invalidUTF8Bytes2 = new byte[]{(byte)0b10101010, (byte)0b00111111 };
    byte[] validUTF8Bytes1 = new byte[]{(byte)0b11001111, (byte)0b10111111 };
    byte[] validUTF8Bytes2 = new byte[]{(byte)0b11101111, (byte)0b10101010, (byte)0b10111111 };

    assertThat(isUTF8(invalidUTF8Bytes1)).isFalse();
    assertThat(isUTF8(invalidUTF8Bytes2)).isFalse();
    assertThat(isUTF8(validUTF8Bytes1)).isTrue();
    assertThat(isUTF8(validUTF8Bytes2)).isTrue();
    assertThat(isUTF8("\u24b6".getBytes(StandardCharsets.UTF_8))).isTrue();
}

测试用例复制自https://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array

实现一个函数来检查 string/byte 数组是否遵循 utf-8 格式

Implement a function to check if a string/byte array follows utf-8 format

java

string

utf-8