如何在 Rust 中逐行读取非 UTF8 文件

How can I read a non-UTF8 file line by line in Rust

我试图在 Rust 中一次一行地读取一个文件,并按照 中的建议开始:

use std::error::Error;
use std::fs::File;
use std::io::{BufRead, BufReader};

fn main() -> Result<(), Box<dyn Error>> {
    let file = File::open("countries.txt")?;
    let reader = BufReader::new(file);
    for line in reader.lines() {
        match line {
            Ok(line) => println!("Ok: {}", line),
            Err(error) => println!("Err: {}", error),
        }
    }
    return Ok(());
}

但是,我有非 UTF8 文件。 Python chardet.universaldetector 库告诉我这是 ISO-8859-1:

Cuba
Curaçao
Cyprus
Czech Republic
Côte d'Ivoire

开箱即用,Rust 无法解释包含非 UTF8 字符的行:

$ ./target/release/main1 
Ok: Cuba
Err: stream did not contain valid UTF-8
Ok: Cyprus
Ok: Czech Republic
Err: stream did not contain valid UTF-8

所以我尝试了 encoding_rs_io 库。我在这里使用 Windows 1252 而不是 ISO-8859-1,但它似乎适用于以下数据:

use std::error::Error;
use std::fs::File;
use std::io::Read;

use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;

fn main() -> Result<(), Box<dyn Error>> {
    let file = File::open("countries.txt")?;
    let mut reader = DecodeReaderBytesBuilder::new().encoding(Some(WINDOWS_1252)).build(file);
    let mut buffer = vec![];
    reader.read_to_end(&mut buffer)?;
    println!("{}", String::from_utf8(buffer).unwrap());
    return Ok(());
}

成功读取UTF8字符:

$ ./target/release/main2 
Cuba
Curaçao
Cyprus
Czech Republic
Côte d'Ivoire

但是,它没有lines()方法,所以不能一次读一行。我注意到 ripgrep 项目使用这个库来解码非 UTF8 文件,并且我已经在调试器中进入了它的源代码。据我所知,它正在进行自己的手动 CR/LF 检测。

所以,在 Rust 中一次一行读取非 UTF8 文件的任务肯定已经解决了。我真的需要重新发明轮子吗?帮助感激不尽!

DecodeReaderBytesimplements io::Read, so you should be able to wrap it in a std::io::BufReader and use its lines方法:

use std::error::Error;
use std::fs::File;
use std::io::{BufReader, BufRead, Read};

use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;

fn main() -> Result<(), Box<dyn Error>> {
    let file = File::open("countries.txt")?;
    let mut reader = BufReader::new(
        DecodeReaderBytesBuilder::new()
            .encoding(Some(WINDOWS_1252))
            .build(file));
    for line in reader.lines() {
        println!("{}", line);
    }
    return Ok(());
}