jsoup 和字符编码

Question

我有一堆关于 jsoup 的字符集支持的问题，其中大部分都得到了 API 文档的引用：

jsoup.Jsoup:

public static Document parse(File in, String charsetName) ...
Set to null to determine from http-equiv meta tag, if present, or fall back to UTF-8 ...

这是否意味着 'charset' 元标记未用于检测编码？
jsoup.nodes.Document:

public void charset(Charset charset)
... This method is equivalent to OutputSettings.charset(Charset) but in addition ...

public Charset charset()
... This method is equivalent to Document.OutputSettings.charset().

这是否意味着没有 "input charset" 和 "output charset"，并且它们确实是相同的设置？
jsoup.nodes.Document:

public void charset(Charset charset) ... Obsolete charset / encoding definitions are removed!

这会删除 'http-equiv' 元标记以代替 'charset' 元标记吗？为了向后兼容，有没有办法同时保留两者？
jsoup.nodes.Document.OutputSettings:

public Charset charset() Where possible (when parsing from a URL or File), the document's output charset is automatically set to the input charset. Otherwise, it defaults to UTF-8.

我需要知道文档是否没有指定编码*。这是否意味着 jsoup 无法提供此信息？

* 而不是默认为 UTF-8，我将运行 juniversalchardet。

Answer 1

文档已过时/不完整。 Jsoup 确实使用字符集元标记以及 http-equiv 标记来检测字符集。从源码中，我们看到这个方法是这样的：

public static Document parse(File in, String charsetName) throws IOException {
    return DataUtil.load(in, charsetName, in.getAbsolutePath());
}

DataUtil.load 依次调用 parseByteData(...)，如下所示：(Source, scroll down)

//reads bytes first into a buffer, then decodes with the appropriate charset. done this way to support
// switching the chartset midstream when a meta http-equiv tag defines the charset.
// todo - this is getting gnarly. needs a rewrite.
static Document parseByteData(ByteBuffer byteData, String charsetName, String baseUri, Parser parser) {
  String docData;
  Document doc = null;

   if (charsetName == null) { // determine from meta. safe parse as UTF-8
    // look for <meta http-equiv="Content-Type" content="text/html;charset=gb2312"> or HTML5 <meta charset="gb2312">
    docData = Charset.forName(defaultCharset).decode(byteData).toString();
    doc = parser.parseInput(docData, baseUri);
    Element meta = doc.select("meta[http-equiv=content-type], meta[charset]").first();
    if (meta != null) { // if not found, will keep utf-8 as best attempt
        String foundCharset = null;
        if (meta.hasAttr("http-equiv")) {
            foundCharset = getCharsetFromContentType(meta.attr("content"));
        }
        if (foundCharset == null && meta.hasAttr("charset")) {
            try {
                if (Charset.isSupported(meta.attr("charset"))) {
                    foundCharset = meta.attr("charset");
                }
            } catch (IllegalCharsetNameException e) {
                foundCharset = null;
            }
        }

        (Snip...)

上面代码片段中的下一行向我们表明，它确实使用 meta[http-equiv=content-type] 或 meta[charset] 来检测编码，否则回退到 utf8。

Element meta = doc.select("meta[http-equiv=content-type], meta[charset]").first();

我不太确定你在这里的意思，但是不，输出字符集设置控制在打印文档 HTML / XML 时转义哪些字符字符串，而输入字符集决定文件的读取方式。

它只会删除 meta[name=charset] 项。从源代码中，更新/删除文档中字符集定义的方法：(Source, again scroll down)

private void ensureMetaCharsetElement() {
if (updateMetaCharset) {
    OutputSettings.Syntax syntax = outputSettings().syntax();

    if (syntax == OutputSettings.Syntax.html) {
        Element metaCharset = select("meta[charset]").first();

        if (metaCharset != null) {
            metaCharset.attr("charset", charset().displayName());
        } else {
            Element head = head();

            if (head != null) {
                head.appendElement("meta").attr("charset", charset().displayName());
            }
        }

        // Remove obsolete elements
        select("meta[name=charset]").remove();
    } else if (syntax == OutputSettings.Syntax.xml) {
    (Snip..)

本质上，如果您调用 charset(...) 并且它没有字符集元标记，它将添加一个，否则更新现有的。它不涉及 http-equiv 标签。

如果要查看documet是否指定编码，只需查找http-equiv charset或meta charset标签，如果没有这样的标签，则说明该文档没有指定编码。

Jsoup 是开源的，你可以自己查看源代码来了解它的具体工作原理：https://github.com/jhy/jsoup/（你也可以修改它来做你想要的！）

我会在有空的时候用更多的细节更新这个答案。如果您还有其他问题，请告诉我。

jsoup 和字符编码

jsoup and character encoding

character-encoding

jsoup