使用 Kaitai Struct 在 Java 中解析超过 2.15 GB 的文件

Question

我正在使用 Kaitai-Struct 在 Java 中解析大型 PCAP 文件。每当文件大小超过 Integer.MAX_VALUE 字节时，我都会遇到由底层 ByteBuffer.

的大小限制引起的 IllegalArgumentException

我没有在其他地方找到对这个问题的引用，这让我相信这不是库限制，而是我使用它的方式的错误。

由于问题是由于试图将整个文件映射到 ByteBuffer 我认为解决方案是只映射文件的第一个区域，并且随着数据的消耗映射再次跳过已经解析的数据。

由于这是在 Kaitai Struct Runtime 库中完成的，这意味着我要编写自己的 class 扩展 KatiaiStream 并覆盖自动生成的 fromFile(...) 方法，而这不会看起来确实是正确的方法。

从文件解析 PCAP class 的自动生成方法是。

public static Pcap fromFile(String fileName) throws IOException {
  return new Pcap(new ByteBufferKaitaiStream(fileName));
}

并且由 Kaitai Struct Runtime 库提供的 ByteBufferKaitaiStream 由 ByteBuffer 支持。

private final FileChannel fc;
private final ByteBuffer bb;

public ByteBufferKaitaiStream(String fileName) throws IOException {
    fc = FileChannel.open(Paths.get(fileName), StandardOpenOption.READ);
    bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
}

这又受到 ByteBuffer 最大大小的限制。

我是否遗漏了一些明显的解决方法？ Java中Katiati Struct的实现真的有限制吗？

Answer 1

这个库提供了一个使用 long 偏移量的 ByteBuffer 实现。我没有尝试过这种方法，但看起来很有希望。请参阅第 大于 2 GB 的映射文件

http://www.kdgregory.com/index.php?page=java.byteBuffer

public int getInt(long index)
{
    return buffer(index).getInt();
}

private ByteBuffer buffer(long index)
{
    ByteBuffer buf = _buffers[(int)(index / _segmentSize)];
    buf.position((int)(index % _segmentSize));
    return buf;
}

public MappedFileBuffer(File file, int segmentSize, boolean readWrite)
throws IOException
{
    if (segmentSize > MAX_SEGMENT_SIZE)
        throw new IllegalArgumentException(
                "segment size too large (max " + MAX_SEGMENT_SIZE + "): " + segmentSize);

    _segmentSize = segmentSize;
    _fileSize = file.length();

    RandomAccessFile mappedFile = null;
    try
    {
        String mode = readWrite ? "rw" : "r";
        MapMode mapMode = readWrite ? MapMode.READ_WRITE : MapMode.READ_ONLY;

        mappedFile = new RandomAccessFile(file, mode);
        FileChannel channel = mappedFile.getChannel();

        _buffers = new MappedByteBuffer[(int)(_fileSize / segmentSize) + 1];
        int bufIdx = 0;
        for (long offset = 0 ; offset < _fileSize ; offset += segmentSize)
        {
            long remainingFileSize = _fileSize - offset;
            long thisSegmentSize = Math.min(2L * segmentSize, remainingFileSize);
            _buffers[bufIdx++] = channel.map(mapMode, offset, thisSegmentSize);
        }
    }
    finally
    {
        // close quietly
        if (mappedFile != null)
        {
            try
            {
                mappedFile.close();
            }
            catch (IOException ignored) { /* */ }
        }
    }
}

Answer 2

这里有两个不同的问题：

运行 Pcap.fromFile() 对于大文件通常不是一个非常有效的方法，因为你最终会得到 all 文件解析一次进入内存阵列。 kaitai_struct/issues/255 中给出了如何避免这种情况的示例。基本思想是您希望控制读取每个数据包的方式，然后在以某种方式解析/计算后处理每个数据包。
Java 的映射文件限制为 2GB。为了缓解这种情况，您可以使用替代的 RandomAccessFile-based KaitaiStream 实现：RandomAccessFileKaitaiStream — 它可能会更慢，但它应该可以避免 2GB 的问题。

使用 Kaitai Struct 在 Java 中解析超过 2.15 GB 的文件

Parsing files over 2.15 GB in Java using Kaitai Struct

java

kaitai-struct