如何从 hyper::client::Request 中正确读取字节序列并将其作为 UTF-8 字符串打印到控制台？

Question

我正在探索 Rust 并尝试发出一个简单的 HTTP 请求（使用 hyper crate）并将响应正文打印到控制台。响应实现 std::io::Read。阅读各种文档资源和基本教程，我得到了以下代码，我使用 RUST_BACKTRACE=1 cargo run:

编译和执行

use hyper::client::Client;
use std::io::Read;

pub fn print_html(url: &str) {
    let client = Client::new();
    let req = client.get(url).send();

    match req {
        Ok(mut res) => {
            println!("{}", res.status);

            let mut body = String::new();

            match res.read_to_string(&mut body) {
                Ok(body) => println!("{:?}", body),
                Err(why) => panic!("String conversion failure: {:?}", why)
            }
        },
        Err(why) => panic!("{:?}", why)
    }
}

预计：

一个漂亮的、人类可读的 HTML 正文内容，由 HTTP 服务器传送，被打印到控制台。

实际：

200 OK
thread '<main>' panicked at 'String conversion failure: Error { repr: Custom(Custom { kind: InvalidData, error: StringError("stream did not contain valid UTF-8") }) }', src/printer.rs:16
stack backtrace:
   1:        0x109e1faeb - std::sys::backtrace::tracing::imp::write::h3800f45f421043b8
   2:        0x109e21565 - std::panicking::default_hook::_$u7b$$u7b$closure$u7d$$u7d$::h0ef6c8db532f55dc
   3:        0x109e2119e - std::panicking::default_hook::hf3839060ccbb8764
   4:        0x109e177f7 - std::panicking::rust_panic_with_hook::h5dd7da6bb3d06020
   5:        0x109e21b26 - std::panicking::begin_panic::h9bf160aee246b9f6
   6:        0x109e18248 - std::panicking::begin_panic_fmt::haf08a9a70a097ee1
   7:        0x109d54378 - libplayground::printer::print_html::hff00c339aa28fde4
   8:        0x109d53d76 - playground::main::h0b7387c23270ba52
   9:        0x109e20d8d - std::panicking::try::call::hbbf4746cba890ca7
  10:        0x109e23fcb - __rust_try
  11:        0x109e23f65 - __rust_maybe_catch_panic
  12:        0x109e20bb1 - std::rt::lang_start::hbcefdc316c2fbd45
  13:        0x109d53da9 - main
error: Process didn't exit successfully: `target/debug/playground` (exit code: 101)

想法

自从我从服务器收到 200 OK 后，我相信我收到了来自服务器的有效响应（我也可以通过使用更熟悉的编程语言执行相同的请求来凭经验证明这一点）。所以这个错误肯定是我把字节序列转换成UTF-8字符串错误造成的。

备选方案

我还尝试了以下解决方案，这让我可以将字节作为一系列十六进制字符串打印到控制台，但我知道这根本是错误的，因为 UTF-8 字符可以有1-4 个字节。因此，在此示例中尝试将单个字节转换为 UTF-8 字符将仅适用于 UTF-8 字符的非常有限（准确地说是 255 个）子集。

use hyper::client::Client;
use std::io::Read;

pub fn print_html(url: &str) {
    let client = Client::new();
    let req = client.get(url).send();

    match req {
        Ok(res) => {
            println!("{}", res.status);

            for byte in res.bytes() {
                print!("{:x}", byte.unwrap());
            }
        },
        Err(why) => panic!("{:?}", why)
    }
}

Answer 1

我们可以用 iconv 命令确认从 http://www.google.com 编辑的数据 return 不是有效的 UTF-8:

$ wget http://google.com -O page.html
$ iconv -f utf-8 page.html > /dev/null
iconv: illegal input sequence at position 5591

对于其他一些网址（如 http://www.reddit.com），代码工作正常。

如果我们假设大部分数据是有效的 UTF-8，我们可以使用 String::from_utf8_lossy 解决问题：

pub fn print_html(url: &str) {
    let client = Client::new();
    let req = client.get(url).send();

    match req {
        Ok(mut res) => {
            println!("{}", res.status);

            let mut body = Vec::new();

            match res.read_to_end(&mut body) {
                Ok(_) => println!("{:?}", String::from_utf8_lossy(&*body)),
                Err(why) => panic!("String conversion failure: {:?}", why),
            }
        }
        Err(why) => panic!("{:?}", why),
    }
}

注意Read::read_to_string and Read::read_to_end return Ok是读取成功的字节数，不是读取的数据。

Answer 2

如果你真的看 headers 那 Google returns:

HTTP/1.1 200 OK
Date: Fri, 22 Jul 2016 20:45:54 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
P3P: CP="This is not a P3P policy! See https://www.google.com/support/accounts/answer/151657?hl=en for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Set-Cookie: NID=82=YwAD4Rj09u6gUA8OtQH73BUz6UlNdeRc9Z_iGjyaDqFdRGMdslypu1zsSDWQ4xRJFyEn9-UtR7U6G7HKehoyxvy9HItnDlg8iLsxzlhNcg01luW3_-HWs3l9S3dmHIVh; expires=Sat, 21-Jan-2017 20:45:54 GMT; path=/; domain=.google.ca; HttpOnly
Alternate-Protocol: 443:quic
Alt-Svc: quic=":443"; ma=2592000; v="36,35,34,33,32,31,30,29,28,27,26,25"
Accept-Ranges: none
Vary: Accept-Encoding
Transfer-Encoding: chunked

可以看到

Content-Type: text/html; charset=ISO-8859-1

另外

Therefore, the error must be caused by me incorrectly converting the byte sequence into an UTF-8 string.

没有发生到 UTF-8 的转换。 read_to_string 只是确保数据是 UTF-8.

简单地说，假设任意 HTML 页面以 UTF-8 编码是完全错误的。充其量，您必须解析 headers 以找到编码，然后转换数据。这很复杂，因为 there's no real definition for what encoding the headers are in.

找到正确的编码后，您可以使用 encoding 等 crate 将结果正确转换为 UTF-8，如果结果是文本的话！请记住，HTTP 可以 return 二进制文件，例如图像。

如何从 hyper::client::Request 中正确读取字节序列并将其作为 UTF-8 字符串打印到控制台？

How can I properly read the sequence of bytes from a hyper::client::Request and print it to the console as a UTF-8 string?

bytebuffer

utf-8

rust

预计：

实际：

想法

备选方案