如何在 Rust 中逐行读取非 UTF8 文件
How can I read a non-UTF8 file line by line in Rust
我试图在 Rust 中一次一行地读取一个文件,并按照 中的建议开始:
use std::error::Error;
use std::fs::File;
use std::io::{BufRead, BufReader};
fn main() -> Result<(), Box<dyn Error>> {
let file = File::open("countries.txt")?;
let reader = BufReader::new(file);
for line in reader.lines() {
match line {
Ok(line) => println!("Ok: {}", line),
Err(error) => println!("Err: {}", error),
}
}
return Ok(());
}
但是,我有非 UTF8 文件。 Python chardet.universaldetector
库告诉我这是 ISO-8859-1:
Cuba
Curaçao
Cyprus
Czech Republic
Côte d'Ivoire
开箱即用,Rust 无法解释包含非 UTF8 字符的行:
$ ./target/release/main1
Ok: Cuba
Err: stream did not contain valid UTF-8
Ok: Cyprus
Ok: Czech Republic
Err: stream did not contain valid UTF-8
所以我尝试了 encoding_rs_io 库。我在这里使用 Windows 1252 而不是 ISO-8859-1,但它似乎适用于以下数据:
use std::error::Error;
use std::fs::File;
use std::io::Read;
use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;
fn main() -> Result<(), Box<dyn Error>> {
let file = File::open("countries.txt")?;
let mut reader = DecodeReaderBytesBuilder::new().encoding(Some(WINDOWS_1252)).build(file);
let mut buffer = vec![];
reader.read_to_end(&mut buffer)?;
println!("{}", String::from_utf8(buffer).unwrap());
return Ok(());
}
成功读取UTF8字符:
$ ./target/release/main2
Cuba
Curaçao
Cyprus
Czech Republic
Côte d'Ivoire
但是,它没有lines()
方法,所以不能一次读一行。我注意到 ripgrep 项目使用这个库来解码非 UTF8 文件,并且我已经在调试器中进入了它的源代码。据我所知,它正在进行自己的手动 CR/LF 检测。
所以,在 Rust 中一次一行读取非 UTF8 文件的任务肯定已经解决了。我真的需要重新发明轮子吗?帮助感激不尽!
DecodeReaderBytes
implements io::Read
, so you should be able to wrap it in a std::io::BufReader
and use its lines
方法:
use std::error::Error;
use std::fs::File;
use std::io::{BufReader, BufRead, Read};
use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;
fn main() -> Result<(), Box<dyn Error>> {
let file = File::open("countries.txt")?;
let mut reader = BufReader::new(
DecodeReaderBytesBuilder::new()
.encoding(Some(WINDOWS_1252))
.build(file));
for line in reader.lines() {
println!("{}", line);
}
return Ok(());
}
我试图在 Rust 中一次一行地读取一个文件,并按照
use std::error::Error;
use std::fs::File;
use std::io::{BufRead, BufReader};
fn main() -> Result<(), Box<dyn Error>> {
let file = File::open("countries.txt")?;
let reader = BufReader::new(file);
for line in reader.lines() {
match line {
Ok(line) => println!("Ok: {}", line),
Err(error) => println!("Err: {}", error),
}
}
return Ok(());
}
但是,我有非 UTF8 文件。 Python chardet.universaldetector
库告诉我这是 ISO-8859-1:
Cuba
Curaçao
Cyprus
Czech Republic
Côte d'Ivoire
开箱即用,Rust 无法解释包含非 UTF8 字符的行:
$ ./target/release/main1
Ok: Cuba
Err: stream did not contain valid UTF-8
Ok: Cyprus
Ok: Czech Republic
Err: stream did not contain valid UTF-8
所以我尝试了 encoding_rs_io 库。我在这里使用 Windows 1252 而不是 ISO-8859-1,但它似乎适用于以下数据:
use std::error::Error;
use std::fs::File;
use std::io::Read;
use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;
fn main() -> Result<(), Box<dyn Error>> {
let file = File::open("countries.txt")?;
let mut reader = DecodeReaderBytesBuilder::new().encoding(Some(WINDOWS_1252)).build(file);
let mut buffer = vec![];
reader.read_to_end(&mut buffer)?;
println!("{}", String::from_utf8(buffer).unwrap());
return Ok(());
}
成功读取UTF8字符:
$ ./target/release/main2
Cuba
Curaçao
Cyprus
Czech Republic
Côte d'Ivoire
但是,它没有lines()
方法,所以不能一次读一行。我注意到 ripgrep 项目使用这个库来解码非 UTF8 文件,并且我已经在调试器中进入了它的源代码。据我所知,它正在进行自己的手动 CR/LF 检测。
所以,在 Rust 中一次一行读取非 UTF8 文件的任务肯定已经解决了。我真的需要重新发明轮子吗?帮助感激不尽!
DecodeReaderBytes
implements io::Read
, so you should be able to wrap it in a std::io::BufReader
and use its lines
方法:
use std::error::Error;
use std::fs::File;
use std::io::{BufReader, BufRead, Read};
use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;
fn main() -> Result<(), Box<dyn Error>> {
let file = File::open("countries.txt")?;
let mut reader = BufReader::new(
DecodeReaderBytesBuilder::new()
.encoding(Some(WINDOWS_1252))
.build(file));
for line in reader.lines() {
println!("{}", line);
}
return Ok(());
}