在 Weka 中使用 utf-8 arff 文件时无法确定结构为 arff

Question

我在尝试使用 Weka 打开 arff 文件时遇到问题。

当 arff 文件的编码设置为 ANSI 时，一切似乎都运行良好。但是当我将编码设置为 utf-8（这是我的数据所需要的）时，出现以下错误：

Unable to determine structure as arff(Reason java.io.Exception: keyword @relation expected,read token[@relation], line 1).

我的 arff 文件格式似乎正确。

@relation myrelation

@attribute pagename string
@attribute pagetext string
@attribute pagecategory string
@attribute pageclass {0,1,2,3,4,5,6,7,8,9,10}

@data
.......

注意：我还在 RunWeka.ini 文件

中将文件编码更改为 utf-8

Answer 1

由于错误提到第 1 行，我怀疑 UTF-8 文件是在文件开头用 BOM 编写的。这个不需要的零宽度 space 被 Windows 下的记事本用来区分 ANSI 文本文件和 UTF-8 文本文件。

创建没有 BOM 的文件，U+FEFF。这可以通过程序员的编辑器（JEdit、Notepad++）、一些十六进制编辑器来完成，或者您可以删除第一行并重新键入它。检查文件大小。

很多解析器不期望这样的BOM，不认为是白space，挂了。

Path path = Paths.get("...");
String s = new String(Files.readAllBytes(path), StandardCharsets.UTF_8);
String t = s.replaceFirst("^\uFEFF", "");
if (!s.equals(t)) {
    System.out.println("BOM character present in UTF-8 text");
    Files.write(path, t.getBytes(StandardCharsets.UTF_8)); // Replaces file!
}

在 Weka 中使用 utf-8 arff 文件时无法确定结构为 arff

Unable to determine structure as arff when using utf-8 arff file in Weka

nlp

machine-learning

weka