BufferedReader 无法读取长行

Question

我正在读取此文件：https://www.reddit.com/r/tech/top.json?limit=100 从 HttpUrlConnection 到 BufferedReader。我已经让它读取了一些文件，但它只读取了它应该读取的大约 1/10。如果我改变输入缓冲区的大小，它不会改变任何东西——它只是以更小的块打印相同的东西：

try{
    URL url = new URL(urlString);

    HttpURLConnection connection = (HttpURLConnection) url.openConnection();

    BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));

    StringBuilder sb = new StringBuilder();

    int charsRead;
    char[] inputBuffer = new char[500];
    while(true) {
        charsRead = reader.read(inputBuffer);
        if(charsRead < 0) {
            break;
        }
        if(charsRead > 0) {
            sb.append(String.copyValueOf(inputBuffer, 0, charsRead));
            Log.d(TAG, "Value read " + String.copyValueOf(inputBuffer, 0, charsRead));
        }
    }

    reader.close();

    return sb.toString();
} catch(Exception e){
   e.printStackTrace();
}

我认为问题是文本都在一行中，因为它没有在 json 中正确格式化，而且 BufferedReader 只能占用这么长的一行。有什么解决办法吗？

Answer 1

read() 应继续阅读 charsRead > 0。每次调用 read 时，reader 标记它最后一次读取的位置，下一次调用从该位置开始并继续，直到没有更多可读取的内容。它可以读取的大小没有限制。唯一的限制是数组的大小，但文件的总大小是 none.

您可以尝试以下方法：

try(InputStream is = connection.getInputStream(); 
   ByteArrayOutputStream baos = new ByteArrayOutputStream()) {

  int read = 0;
  byte[] buffer = new byte[4096];

  while((read = is.read(buffer)) > 0) {
    baos.write(buffer, 0, read);
  }

  return new String(baos.toByteArray(), StandardCharsets.UTF_8);
} catch (Exception ex){}

上述方法纯粹使用流中的字节并将其读入输出流，然后从中创建字符串。

Answer 2

I believe the issue is that the text is all on one line since it's not formatted in json correctly, and BufferedReader can only take a line so long.

这个解释不正确：

您不是一次阅读一行，BufferedReader 也没有将文本视为基于行。
即使您一次从 BufferedReader 中读取一行（即使用 readLine()），对一行长度的唯一限制是Java String 长度（2^31 - 1 个字符），以及堆的大小。

^{另外，请注意“正确”JSON 格式是主观的。 JSON 规范对格式没有任何说明。 JSON 发射器通常不会浪费 CPU 周期和网络带宽来格式化 JSON 人类很少阅读的内容。消耗 JSON 的应用程序代码需要能够处理这个问题。}

那么到底发生了什么？

不清楚，但有一些可能性：

A StringBuilder also 的固有限制为 2^31 - 1 个字符。但是，对于（至少）某些实现，如果您尝试将 StringBuilder 增大到超出该限制，它将抛出 OutOfMemoryError。（此行为似乎没有记录，但阅读 Java 8 中的源代码可以清楚地看到。）
也许您读取数据的速度太慢（例如，因为您的网络连接太慢）并且服务器连接超时。
可能服务器对它愿意在响应中发送的数据量有限制。

因为你没有提到任何异常，而且你似乎总是得到相同数量的数据，我怀疑第三种解释是正确的。

Answer 3

我的大胆猜测是您的默认平台字符集是 UTF-8，并且出现了编码问题。对于远程内容，应该指定编码，而不是假设它等于您机器上的默认编码。

响应数据的字符集必须正确。为此，必须检查 headers。默认值应为 Latin-1、ISO-8859-1，但浏览器会解释为作为 Windows Latin-1, Cp-1252.

        String charset = connection.getContentType().replace("^.*(charset=|$)", "");
        if (charset.isEmpty()) {
            charset = "Windows-1252"; // Windows Latin-1
        }

那你就可以更好的读取bytes了，因为读取的字节数和读取的字符数并没有精确的对应关系。如果缓冲区的末尾是代理项对的第一个字符，形成Unicode字形的两个UTF-16字符，符号，代码点在U + FFFF之上，我不知道底层的效率 "repair."

        BufferedInputStream in = new BufferedInputStream(connection.getInputStream());
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        byte[] buffer = new byte[512];
        while (true) {
            int bytesRead = in.read(buffer);
            if (bytesRead < 0) {
                break;
            }
            if (bytesRead > 0) {
                out.write(buffer, 0, bytesRead);
            }
        }
        return out.toString(charset);

确实这样做是安全的：

sb.append(inputBuffer, 0, charsRead);

（复制一份可能是一次修复尝试。）

顺便说一下，char[500] 占用的内存几乎是 byte[512] 的两倍。

我看到该网站在我的浏览器中使用了 gzip 压缩。这对于 json 这样的文本是有意义的。我通过设置请求来模仿它 header Accept-Encoding: gzip.

    URL url = new URL("https://www.reddit.com/r/tech/top.json?limit=100");
    HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    connection.setRequestProperty("Accept-Encoding", "gzip");
    try (InputStream rawIn = connection.getInputStream()) {
        String charset = connection.getContentType().replaceFirst("^.*?(charset=|$)", "");
        if (charset.isEmpty()) {
            charset = "Windows-1252"; // Windows Latin-1
        }
        boolean gzipped = "gzip".equals(connection.getContentEncoding());
        System.out.println("gzip=" + gzipped);

        try (InputStream in = gzipped ? new GZIPInputStream(rawIn)
                : new BufferedInputStream(rawIn)) {
            ByteArrayOutputStream out = new ByteArrayOutputStream();
            byte[] buffer = new byte[512];
            while (true) {
                int bytesRead = in.read(buffer);
                if (bytesRead < 0) {
                    break;
                }
                if (bytesRead > 0) {
                    out.write(buffer, 0, bytesRead);
                }
            }
            return out.toString(charset);
        }
    }

可能是因为不符合 gzip "browsers" 压缩内容的内容长度在响应中设置错误。 这是一个错误。

Answer 4

我建议使用3d party Http 客户端。它可以将您的代码从字面上减少到几行，您不必担心所有这些小细节。底线是——有人已经编写了您正在尝试编写的代码。它有效并且已经过良好测试。几点建议：

Apache Http Client - 一个众所周知且流行的 Http 客户端，但对于像您这样的简单案例来说可能有点笨重和复杂。
Ok Http Client - 另一个著名的 Http 客户端
最后，我最喜欢的（因为它是我写的）MgntUtils 开源库有 Http 客户端。可以找到 Maven 工件 here, GitHub that includes the library itself as a jar file, source code, and Javadoc can be found here and JavaDoc is here

使用 MgntUtils 库的代码只是为了演示您想在此处执行的操作的简单性。（我测试了代码，它工作起来很神奇）

private static void testHttpClient() {
    HttpClient client = new HttpClient();
    client.setContentType("application/json; charset=utf-8");
    client.setConnectionUrl("https://www.reddit.com/r/tech/top.json?limit=100");
    String content = null;
    try {
        content = client.sendHttpRequest(HttpMethod.GET);
    } catch (IOException e) {
        content = client.getLastResponseMessage() + TextUtils.getStacktrace(e, false);
    }
    System.out.println(content);
}

BufferedReader 无法读取长行

BufferedReader can't read long line

java

parsing

json

reader