dd：将二进制文件读取为大小为 N 的块返回的数据少于 N

Question

我需要分段处理大型二进制文件。在概念上这类似于 split，但不是将每个段写入文件，我需要获取该段并将其作为另一个进程的输入发送。我以为我可以使用 dd 到 read/write 块中的文件，但结果完全不是我所期望的。例如，如果我尝试：

dd if=some_big_file bs=1M |
    while : ; do
        dd bs=1M count=1 | processor
    done

...输出大小实际上是 131,072 字节而不是 1,048,576。

谁能告诉我为什么我没有看到输出被阻塞到 1M 块以及我如何才能更好地完成我想做的事情？

谢谢。

Answer 1

首先，你不需要第一个dd。 cat file | while 或 done < file 也可以解决问题。

dd bs=1M count=1可能return不到1M，看 When is dd suitable for copying data? (or, when are read() and write() partial)

而不是 dd count=… 使用 head 和（非 posix）选项 -c ….

file=some_big_file
(( m = 1024 ** 2 ))
(( blocks = ($(stat -c %s "$file") + m - 1) / m ))
for ((i=0; i<blocks; ++i)); do
  head -c "$m" | processor
done < "$file"

或posix符合但效率很低

(( octM = 4 * 1024 * 1024 ))
someCommand | od -v -to1 -An | tr -d \n | tr ' ' '\' |
while IFS= read -rN $octM block; do
  printf %b "$block" | processor
done

Answer 2

根据 dd 的 manual:

bs=bytes

[...] if no data-transforming conv option is specified, input is copied to the output as soon as it's read, even if it is smaller than the block size.

所以尝试 dd iflag=fullblock:

fullblock

Accumulate full blocks from input. The read system call may return early if a full block is not available. When that happens, continue calling read to fill the remainder of the block. This flag can be used only with iflag. This flag is useful with pipes for example as they may return short reads. In that case, this flag is needed to ensure that a count= argument is interpreted as a block count rather than a count of read operations.

dd：将二进制文件读取为大小为 N 的块返回的数据少于 N

dd: reading binary file as blocks of size N returned less data than N

binary

bash

file

chunks