fseek 和 Microsoft 的 CRT UNICODE 支持问题

Problem wih fseek and Microsoft's CRT UNICODE support

我正在尝试使用 unicode 流读取 UTF8 编码的文本文件。这工作正常,但似乎有一个错误 fseek 用以下简单文件和程序演示:

我正在阅读的文本文件:

ABC

文件的原始内容

EF BB BF 41 42 43 0D 0A 

如您所见,该文件包含 UTF-8 BOM 和字符 ABC,后跟行尾。

程序使用 UNICODE 支持打开文件,然后读取一行并显示缓冲区的原始内容,这是预期的。然后它 fseeks 到开头并再次读取该行,但是这次缓冲区的内容不同;缓冲区开头有两个字节,实际上是 UTF-16 little endian 编码文件的 BOM。

计划

#define _CRT_SECURE_NO_WARNINGS

#include <stdio.h>
#include <stdlib.h>

int main()
{
  FILE *input = _wfopen(L"utf8filewithbom.txt", L"r, ccs=UTF-8");
  if (input == NULL)
  {
    printf("Can't open file\n");
    return 1;
  }

  unsigned char buffer[100];
  fgetws((wchar_t*)buffer, _countof(buffer) / 2, input);
  printf("First 4 bytes of buffer: %02x %02x %02x %02x\n", buffer[0], buffer[1], buffer[2], buffer[3]);

  fseek(input, 0, SEEK_SET);

  fgetws((wchar_t*)buffer, _countof(buffer) / 2, input);
  printf("First 4 bytes of buffer: %02x %02x %02x %02x\n", buffer[0], buffer[1], buffer[2], buffer[3]);

  fclose(input);
}

预期输出:

First 4 bytes of buffer: 41 00 42 00
First 4 bytes of buffer: 41 00 42 00

实际输出:

First 4 bytes of buffer: 41 00 42 00
First 4 bytes of buffer: ff fe 41 00

这是 Microsoft CRT 中的错误还是我做错了什么?

我正在使用 Visual Studio 2019 16.4.3.

我尝试过但没有任何改变的事情:

根据 the Microsoft docs on fseek:

When the CRT opens a file that begins with a Byte Order Mark (BOM), the file pointer is positioned after the BOM (that is, at the start of the file's actual content). If you have to fseek to the beginning of the file, use ftell to get the initial position and fseek to it rather than to position 0.

基本上,只需将您的代码调整为(对 changed/added 行的评论):

  FILE *input = _wfopen(L"utf8filewithbom.txt", L"r, ccs=UTF-8");
  if (input == NULL)
  {
    printf("Can't open file\n");
    return 1;
  }
  const long postbomoffset = ftell(input); // Store post-BOM offset

  unsigned char buffer[100];
  fgetws((wchar_t*)buffer, _countof(buffer) / 2, input);
  printf("First 4 bytes of buffer: %02x %02x %02x %02x\n", buffer[0], buffer[1], buffer[2], buffer[3]);

  fseek(input, postbomoffset, SEEK_SET);  // Seek to post-BOM offset, not raw beginning

  fgetws((wchar_t*)buffer, _countof(buffer) / 2, input);
  printf("First 4 bytes of buffer: %02x %02x %02x %02x\n", buffer[0], buffer[1], buffer[2], buffer[3]);

  fclose(input);