尝试使用 ruby 解压缩 600mb tgz 会给出超出整数范围的错误

Trying to unzip a 600mb tgz with ruby gives out of integer range error

正在尝试解压 tgz 文件...使用以下代码:

tar_extract.each do |entry|
  entry_filename = File.basename(entry.full_name)
  next if entry.directory? # don't unzip directories
  next if !entry.file? # if it's not a file skip  
  next if entry.full_name.starts_with?('/') # another check

  file_path = File.join(working_directory, entry_filename)
  puts "Writing file: #{file_path}"

  File.open(file_path, 'wb') do |f|
    f.write(entry.read)
  end

  bytes = File.size(file_path)

  puts "Successfully wrote file with #{bytes} bytes"
end

tar_extract.close

这段代码通常可以成功运行,但是当 TGZ 中的文件太大时,我会得到一个整数超出范围的错误。

Writing file: /files/working_dir/test1.tar.gz  
Successfully wrote file with 244704472 bytes 

Writing file: /files/working_dir/test2.sql
RangeError: integer 2556143960 too big to convert to `int'
from /usr/local/rvm/rubies/ruby-2.1.1/lib/ruby/site_ruby/2.1.0/rubygems/package/tar_reader/entry.rb:126:in `read'

我不确定我还应该尝试什么。

查看 ruby 源代码,这是代码块:

  ##
  # Reads +len+ bytes from the tar file entry, or the rest of the entry if
  # nil

  def read(len = nil)
    check_closed

    return nil if @read >= @header.size

    len ||= @header.size - @read
    max_read = [len, @header.size - @read].min

    ret = @io.read max_read
    @read += ret.size

    ret
  end

您可能可以通过更改以下内容来解决此问题:

  File.open(file_path, 'wb') do |f|
    f.write(entry.read)
  end

进入一个循环,在该循环中您使用参数调用 entry.read,以获得该迭代中要处理的最大字节数。您可能必须分成两个调用,因为调用 entry.read 可能 return 为零,表明没有更多数据要处理。

在 Joe 的指导下,我找到了答案。

我将 File 块更改为:

File.open(file_path, 'wb') do |f|
  while !entry.eof?
    f.write(entry.read(16000)) # 16 KB
  end
end

之所以选择16KB,是因为我做了一堆benchmark的

b = Benchmark.measure do
  File.open(file_path, 'wb') do |f|
    while !entry.eof?
      f.write(entry.read(16000)) # 16 KB
    end
  end
end

bytes = File.size(file_path)
puts("Successfully wrote file with #{bytes} bytes in #{b.real}")

经过一些研究,似乎每个磁盘都有自己的最佳块大小。我有两个用于基准测试的文件,一个 211mb 的文件和一个 6.6gb 的文件。结果如下,但事实证明 16KB - 64KB 是我的磁盘的最佳范围。

2 gb // 2047483648

Successfully wrote file with 7021620216 bytes in 60.360527059

Successfully wrote file with 220613778 bytes in 2.084798686

1 gb // 1073741824

Successfully wrote file with 7021620216 bytes in 42.345642806
Successfully wrote file with 7021620216 bytes in 48.941375145
Successfully wrote file with 7021620216 bytes in 51.501044608
Successfully wrote file with 7021620216 bytes in 58.81474911

Successfully wrote file with 220613778 bytes in 1.57968424
Successfully wrote file with 220613778 bytes in 2.28171993
Successfully wrote file with 220613778 bytes in 5.905203041
Successfully wrote file with 220613778 bytes in 16.944126945

4KB // 4000

Successfully wrote file with 7021620216 bytes in 43.39409191
Successfully wrote file with 7021620216 bytes in 44.572620161
Successfully wrote file with 7021620216 bytes in 48.510513964
Successfully wrote file with 7021620216 bytes in 53.839022034

Successfully wrote file with 220613778 bytes in 1.982647292
Successfully wrote file with 220613778 bytes in 2.071772595
Successfully wrote file with 220613778 bytes in 2.132004983
Successfully wrote file with 220613778 bytes in 2.221654993

8KB // 8000

Successfully wrote file with 7021620216 bytes in 41.851550514
Successfully wrote file with 7021620216 bytes in 45.611952667
Successfully wrote file with 7021620216 bytes in 50.068614205
Successfully wrote file with 7021620216 bytes in 50.726276706

Successfully wrote file with 220613778 bytes in 1.941246687
Successfully wrote file with 220613778 bytes in 2.456356439
Successfully wrote file with 220613778 bytes in 2.56323527
Successfully wrote file with 220613778 bytes in 3.756049832

16KB // 16000

Successfully wrote file with 7021620216 bytes in 36.929413152
Successfully wrote file with 7021620216 bytes in 36.486866289
Successfully wrote file with 7021620216 bytes in 36.743103326
Successfully wrote file with 7021620216 bytes in 37.019910405

Successfully wrote file with 220613778 bytes in 1.504792162
Successfully wrote file with 220613778 bytes in 1.620161067
Successfully wrote file with 220613778 bytes in 1.622070414
Successfully wrote file with 220613778 bytes in 1.698627821


32kB // 32000

Successfully wrote file with 7021620216 bytes in 35.802759912
Successfully wrote file with 7021620216 bytes in 38.775857377
Successfully wrote file with 7021620216 bytes in 39.116311496
Successfully wrote file with 7021620216 bytes in 39.126005469

Successfully wrote file with 220613778 bytes in 1.696821094
Successfully wrote file with 220613778 bytes in 1.773727215
Successfully wrote file with 220613778 bytes in 4.023144931
Successfully wrote file with 220613778 bytes in 4.08615266


64kb // 64000

Successfully wrote file with 7021620216 bytes in 36.732343382
Successfully wrote file with 7021620216 bytes in 37.914365658
Successfully wrote file with 7021620216 bytes in 38.336098907
Successfully wrote file with 7021620216 bytes in 39.146334479

Successfully wrote file with 220613778 bytes in 1.662487522
Successfully wrote file with 220613778 bytes in 1.674177939
Successfully wrote file with 220613778 bytes in 1.745556917
Successfully wrote file with 220613778 bytes in 1.784492717