字符被转换为特殊字符

Question

我正在使用 Apache POI 读取 .docx 文件，并在一些操作后写入 .csv。我正在使用的 .docx 文件是法语的，但是当我在 .csv 中写入数据时，它会将一些法语字符转换为特殊字符。例如 Être un membre clé 转换为 ÃŠtre un membre clÃ©

下面的代码用于写入文件

        Path path = Paths.get(filePath);
        BufferedWriter bw = Files.newBufferedWriter(path);
        CSVWriter writer = new CSVWriter(bw);
        writer.writeAll(data);

默认使用UTF-8。

在调试时，我在写入 .csv 之前检查过数据是原样的。但它在写作时被转换了吗？我已将默认语言环境设置为 Locale.FRENCH

我是不是漏掉了什么？

Answer 1

成为核心成员"UTF8" =成为核心成员"ANSI"

检查您如何读取最终文件的字符代码。

Answer 2

我怀疑是 Excel 将 UTF-8 编码的 CSV 读取为 ANSI。如果您只是在 Excel 中打开 CSV 而不使用文本导入向导，就会发生这种情况。如果文件开头没有 BOM，那么 Excel 总是期望 ANSI。如果您使用支持 Unicode 的文本编辑器打开 CSV，一切都会正确。

示例：

import java.io.BufferedWriter;

import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.Files;

import java.util.Locale;
import java.util.List;
import java.util.ArrayList;

import com.opencsv.CSVWriter;

class DocxToCSV {

 public static void main(String[] args) throws Exception {

  Locale.setDefault(Locale.FRENCH);

  List<String[]> data = new ArrayList<String[]>();
  data.add(new String[]{"F1", "F2", "F3", "F4"});
  data.add(new String[]{"Être un membre clé", "Être clé", "membre clé"});
  data.add(new String[]{"Être", "un", "membre", "clé"});

  Path path = Paths.get("test.csv");
  BufferedWriter bw = Files.newBufferedWriter(path);

  //bw.write(0xFEFF); bw.flush(); // write a BOM to the file

  CSVWriter writer = new CSVWriter(bw, ';', '"', '"', "\r\n");
  writer.writeAll(data);
  writer.flush();
  writer.close();

 }
}

现在如果你用支持Unicode的文本编辑器打开test.csv，一切都会正确。但是，如果您使用 Excel 打开同一个文件，它看起来像：

现在我们做同样的事情但是有

bw.write(0xFEFF); bw.flush(); // write a BOM to the file

活跃。

这导致 Excel 当 test.csv 被 Excel 简单地打开时这样 Excel:

当然更好的方法总是使用Excel的Text Import Wizard.

对于相同的问题，另请参阅。

字符被转换为特殊字符

Characters get converted into special characters

java

locale

utf-8

apache-poi