DocumentBuilder.parse 似乎随机跳过从 HTTP 请求返回的 XML InputStream 的开头

Question

我有以下代码来发送 HTTP 请求、接收响应（以 XML 的形式）并解析它：

public Document getDocumentElementFromDatabase() {
    // this URL is actually built dynamically from a query, but for this example I just use one of the possible resulting URLs
    String url = "http://musicbrainz.org/ws/2/recording?query=%22Thunderstruck%22+AND+artistname%3A%222Cellos%22";

    try {
        // sleep between successive requests to avoid flooding the server
        Thread.sleep(1000);
        HttpURLConnection connection = runQuery(url);
        InputStream stream = connection.getInputStream();
        if (stream != null) {
            BufferedInputStream buff = new BufferedInputStream(stream);
            return DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(buff);
        }
    }

    // I've grouped exception handling for this example
    catch (ParserConfigurationException | InterruptedException | SAXException | IOException e) {
        e.printStackTrace();
    }

    finally {
        if (connection != null) connection.disconnect();
    }

    return null;
}

private void runQuery(String url) throws MalformedURLException, IOException {
    HttpURLConnection connection = (HttpURLConnection) new URL(url).openConnection();
    connection.setRequestProperty("User-Agent", "MyAppName/1.0 ( myemail@email.email )");
    return connection;
}

此代码被多次调用，有时我会收到以下错误：

[Fatal Error] :1:1: Content is not allowed in prolog.

org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)

at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)

at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)

...

如果我尝试在 Chrome 中访问 URL，我每次都会收到有效的 XML 响应，无论我重新加载多少次。更重要的是，当我运行在我的笔记本电脑上使用完全相同的代码时，似乎没有出现同样的问题。

经过一些修改，我尝试将 InputStreams 直接打印为字符串（使用 this link 中的方法 4），而不是解析它们，我注意到有时响应实际上没有达到预期的 XML header (<?xml version="1.0" encoding="UTF-8" standalone="yes"?>)，但其他时候有。

我的猜测是我对流做错了什么，但我不知道是什么。

Answer 1

我找到问题了。该站点有时似乎 return JSON 响应而不是 XML，这导致解析器崩溃。我已将以下行添加到 runQuery:

connection.setRequestProperty("Accept", "application/xml");

我现在可以成功运行代码没有错误。

DocumentBuilder.parse 似乎随机跳过从 HTTP 请求返回的 XML InputStream 的开头

DocumentBuilder.parse seems to randomly skip the beginning of returned XML InputStream from a HTTP request

java

xml

parsing

inputstream