Binaryreader 从分块加载的 Filestream 中读取
Binaryreader read from Filestream which loads in chunks
我正在使用以下代码从一个巨大的文件(> 10 GB)中读取值:
FileStream fs = new FileStream(fileName, FileMode.Open);
BinaryReader br = new BinaryReader(fs);
int count = br.ReadInt32();
List<long> numbers = new List<long>(count);
for (int i = count; i > 0; i--)
{
numbers.Add(br.ReadInt64());
}
不幸的是,我的 SSD 的读取速度卡在几个 MB/s。我想限制是 SSD 的 IOPS,所以从文件中读取块可能会更好。
问题
每次 BinaryReader 调用 ReadInt64() 时,我代码中的 FileStream 真的只从文件中读取 8 个字节吗?
如果是这样,BinaryReader 是否有一种透明的方式来提供从文件中读取更大块的流以加快该过程?
测试代码
这是创建测试文件和测量读取性能的最小示例。
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
namespace TestWriteRead
{
class Program
{
static void Main(string[] args)
{
System.IO.File.Delete("test");
CreateTestFile("test", 1000000000);
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
IEnumerable<long> test = Read("test");
stopwatch.Stop();
Console.WriteLine("File loaded within " + stopwatch.ElapsedMilliseconds + "ms");
}
private static void CreateTestFile(string filename, int count)
{
FileStream fs = new FileStream(filename, FileMode.CreateNew);
BinaryWriter bw = new BinaryWriter(fs);
bw.Write(count);
for (int i = 0; i < count; i++)
{
long value = i;
bw.Write(value);
}
fs.Close();
}
private static IEnumerable<long> Read(string filename)
{
FileStream fs = new FileStream(filename, FileMode.Open);
BinaryReader br = new BinaryReader(fs);
int count = br.ReadInt32();
List<long> values = new List<long>(count);
for (int i = 0; i < count; i++)
{
long value = br.ReadInt64();
values.Add(value);
}
fs.Close();
return values;
}
}
}
您可以使用 BufferedStream 来增加读取缓冲区大小。
您应该将流配置为使用 SequentialScan 以指示您将从头到尾阅读流。它应该会显着提高速度。
Indicates that the file is to be accessed sequentially from beginning
to end. The system can use this as a hint to optimize file caching. If
an application moves the file pointer for random access, optimum
caching may not occur; however, correct operation is still guaranteed.
using (
var fs = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 8192,
FileOptions.SequentialScan))
{
var br = new BinaryReader(fs);
var count = br.ReadInt32();
var numbers = new List<long>();
for (int i = count; i > 0; i--)
{
numbers.Add(br.ReadInt64());
}
}
尝试读取块:
using (
var fs = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 8192,
FileOptions.SequentialScan))
{
var br = new BinaryReader(fs);
var numbersLeft = (int)br.ReadInt64();
byte[] buffer = new byte[8192];
var bufferOffset = 0;
var bytesLeftToReceive = sizeof(long) * numbersLeft;
var numbers = new List<long>();
while (true)
{
// Do not read more then possible
var bytesToRead = Math.Min(bytesLeftToReceive, buffer.Length - bufferOffset);
if (bytesToRead == 0)
break;
var bytesRead = fs.Read(buffer, bufferOffset, bytesToRead);
if (bytesRead == 0)
break; //TODO: Continue to read if file is not ready?
//move forward in read counter
bytesLeftToReceive -= bytesRead;
bytesRead += bufferOffset; //include bytes from previous read.
//decide how many complete numbers we got
var numbersToCrunch = bytesRead / sizeof(long);
//crunch them
for (int i = 0; i < numbersToCrunch; i++)
{
numbers.Add(BitConverter.ToInt64(buffer, i * sizeof(long)));
}
// move the last incomplete number to the beginning of the buffer.
var remainder = bytesRead % sizeof(long);
Buffer.BlockCopy(buffer, bytesRead - remainder, buffer, 0, remainder);
bufferOffset = remainder;
}
}
更新回复评论:
May I know what's the reason that manual reading is faster than the other one?
我不知道 BinaryReader
实际上是如何实现的。所以这只是假设。
实际从磁盘读取并不是昂贵的部分。昂贵的部分是将 reader 臂移动到磁盘上的正确位置。
由于您的应用程序不是唯一从硬盘驱动器读取的应用程序,因此每次应用程序请求读取时磁盘都必须重新定位。
因此,如果 BinaryReader
只读取请求的 int
它必须在磁盘上等待每次读取(如果其他应用程序在其间进行读取)。
当我直接读取更大的缓冲区(速度更快)时,我可以处理更多的整数,而无需在读取之间等待磁盘。
缓存当然会加快速度,这就是为什么 "just" 快三倍的原因。
(以后readers: 以上内容如有不妥之处还请指正)
理论上 memory mapped files 应该对这里有所帮助。您可以使用几个非常大的块将其加载到内存中。不确定这与使用 SSD 有多大关系。
我正在使用以下代码从一个巨大的文件(> 10 GB)中读取值:
FileStream fs = new FileStream(fileName, FileMode.Open);
BinaryReader br = new BinaryReader(fs);
int count = br.ReadInt32();
List<long> numbers = new List<long>(count);
for (int i = count; i > 0; i--)
{
numbers.Add(br.ReadInt64());
}
不幸的是,我的 SSD 的读取速度卡在几个 MB/s。我想限制是 SSD 的 IOPS,所以从文件中读取块可能会更好。
问题
每次 BinaryReader 调用 ReadInt64() 时,我代码中的 FileStream 真的只从文件中读取 8 个字节吗?
如果是这样,BinaryReader 是否有一种透明的方式来提供从文件中读取更大块的流以加快该过程?
测试代码
这是创建测试文件和测量读取性能的最小示例。
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
namespace TestWriteRead
{
class Program
{
static void Main(string[] args)
{
System.IO.File.Delete("test");
CreateTestFile("test", 1000000000);
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
IEnumerable<long> test = Read("test");
stopwatch.Stop();
Console.WriteLine("File loaded within " + stopwatch.ElapsedMilliseconds + "ms");
}
private static void CreateTestFile(string filename, int count)
{
FileStream fs = new FileStream(filename, FileMode.CreateNew);
BinaryWriter bw = new BinaryWriter(fs);
bw.Write(count);
for (int i = 0; i < count; i++)
{
long value = i;
bw.Write(value);
}
fs.Close();
}
private static IEnumerable<long> Read(string filename)
{
FileStream fs = new FileStream(filename, FileMode.Open);
BinaryReader br = new BinaryReader(fs);
int count = br.ReadInt32();
List<long> values = new List<long>(count);
for (int i = 0; i < count; i++)
{
long value = br.ReadInt64();
values.Add(value);
}
fs.Close();
return values;
}
}
}
您可以使用 BufferedStream 来增加读取缓冲区大小。
您应该将流配置为使用 SequentialScan 以指示您将从头到尾阅读流。它应该会显着提高速度。
Indicates that the file is to be accessed sequentially from beginning to end. The system can use this as a hint to optimize file caching. If an application moves the file pointer for random access, optimum caching may not occur; however, correct operation is still guaranteed.
using (
var fs = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 8192,
FileOptions.SequentialScan))
{
var br = new BinaryReader(fs);
var count = br.ReadInt32();
var numbers = new List<long>();
for (int i = count; i > 0; i--)
{
numbers.Add(br.ReadInt64());
}
}
尝试读取块:
using (
var fs = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 8192,
FileOptions.SequentialScan))
{
var br = new BinaryReader(fs);
var numbersLeft = (int)br.ReadInt64();
byte[] buffer = new byte[8192];
var bufferOffset = 0;
var bytesLeftToReceive = sizeof(long) * numbersLeft;
var numbers = new List<long>();
while (true)
{
// Do not read more then possible
var bytesToRead = Math.Min(bytesLeftToReceive, buffer.Length - bufferOffset);
if (bytesToRead == 0)
break;
var bytesRead = fs.Read(buffer, bufferOffset, bytesToRead);
if (bytesRead == 0)
break; //TODO: Continue to read if file is not ready?
//move forward in read counter
bytesLeftToReceive -= bytesRead;
bytesRead += bufferOffset; //include bytes from previous read.
//decide how many complete numbers we got
var numbersToCrunch = bytesRead / sizeof(long);
//crunch them
for (int i = 0; i < numbersToCrunch; i++)
{
numbers.Add(BitConverter.ToInt64(buffer, i * sizeof(long)));
}
// move the last incomplete number to the beginning of the buffer.
var remainder = bytesRead % sizeof(long);
Buffer.BlockCopy(buffer, bytesRead - remainder, buffer, 0, remainder);
bufferOffset = remainder;
}
}
更新回复评论:
May I know what's the reason that manual reading is faster than the other one?
我不知道 BinaryReader
实际上是如何实现的。所以这只是假设。
实际从磁盘读取并不是昂贵的部分。昂贵的部分是将 reader 臂移动到磁盘上的正确位置。
由于您的应用程序不是唯一从硬盘驱动器读取的应用程序,因此每次应用程序请求读取时磁盘都必须重新定位。
因此,如果 BinaryReader
只读取请求的 int
它必须在磁盘上等待每次读取(如果其他应用程序在其间进行读取)。
当我直接读取更大的缓冲区(速度更快)时,我可以处理更多的整数,而无需在读取之间等待磁盘。
缓存当然会加快速度,这就是为什么 "just" 快三倍的原因。
(以后readers: 以上内容如有不妥之处还请指正)
理论上 memory mapped files 应该对这里有所帮助。您可以使用几个非常大的块将其加载到内存中。不确定这与使用 SSD 有多大关系。