如何使用 Linux split 将一个数 GB 的文件拆分成大约 1.5 GB 的块？

Question

我有一个可以大于 4GB 的文件。我正在使用 linux split 命令按行拆分它（这是要求）。但是拆分原始文件后，我希望拆分文件的大小始终小于 2GB。原始文件大小可能在 3-5 GB 之间。我想在我的 shell 脚本中为此编写一些逻辑，并将行数输入到我下面的 split 命令中，以保持拆分文件大小小于 2 GB。

split -l 100000 -d abc.txt abc

Answer 1

总是建议在发布问题之前参考 manual。 Split 命令提供了一个按字节拆分文件的选项。以下是您可以在 split 命令的手册页中找到的选项。

   -b, --bytes=SIZE
          put SIZE bytes per output file

split --bytes=1500000000 abc.txt abc

您无需明确指定行数。此命令符合您的目的。

Answer 2

^{正在将评论转为答案。}

寻求说明：典型文件中有多少行？线长有多少变化？你能做一些算术，包括误差范围，请求多少行吗？您是否查看过 split 命令的选项？它支持 -C 选项吗？（GNU split 说：-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file — 听起来这可能就是您想要的。）

This is what I thought of doing.

Do wc -l abc.txt — This will give me total no of lines in that file.

Get the file size of original file abc.txt and divide it by no of lines in that file; that will give me size per line.

Divide 1.5 GB or any number less than 2GB by size per line; that will give me no of lines.

Use the no of lines I got from step 3 in my split command.

这就是我问有关文件和行大小的问题的原因。如果您的文件有很多行 10 字节长和一些 20 KiB 长的行，您可能运行会遇到问题；你可能会不小心得到一大块 20 KiB 行，这超出了你的限制，因为它们都组合在一起。但是，您的数据很可能足够统一，因此您不会运行陷入此类问题。

考虑是否值得在你的机器上安装GNU split（不能代替标准问题split；安装在单独的目录中，例如/usr/gnu/bin）。

The number of lines varies from file to file, but one of the files I am working on has 328969322 lines, and the file size is 52.5GB. Yes, I checked the options of my split and it does support -C option. How do I use that in my problem?

我注意到这个数据文件比问题中提到的大小要大得多（大约十倍）。但是，这不是主要问题。

csplit -C 1500000000 datafile

或者，如果您想要 1.5 GiB 而不是 1.5 GB，请使用：

csplit -C 1610612736 datafile

当我试验 csplit -C 20 时，有些行的长度为 40 字节，长行被拆分（最大长度为 20 字节），但较短的行被分组以使文件长达 20 字节.检查小数据文件（和小块大小）上的代码。

从您提供的数据来看，您的行平均每行大约 170 个字节，因此您应该不会遇到任何不当拆分的问题。如果需要，您可以尝试以下操作：

sed 100q datafile | split -C 1700 -

那应该给你大约 10 个文件，每个文件大约 10 行。

Answer 3

我就是这样解决这个问题的。很抱歉迟了发布解决方案。

1.声明了一个全局变量 DEFAULT_SPLITFILE_SIZE= 1.5Gb

DEFAULT_SPLITFILE_SIZE=1500000000

2。计算文件中的行数。

LINES_IN_FILE=`wc -l $file | awk '{print }'`

echo `date`  "Total word count = ${LINES_IN_FILE}."

3。计算出一个文件的大小。

FILE_SIZE=`stat -c %s "${file}"`

4.文件中每一行的计算大小。

SIZE_PER_LINE=$(( FILE_SIZE / LINES_IN_FILE ))

echo `date`  "Bytes Per Line = $SIZE_PER_LINE"

5.计算使其成为 1.5gb 拆分文件所需的行数。

SPLIT_LINE=$(( DEFAULT_SPLITFILE_SIZE / SIZE_PER_LINE ))

echo `date`  "Lines for Split = $SPLIT_LINE"

如何使用 Linux split 将一个数 GB 的文件拆分成大约 1.5 GB 的块？

How to split a multi-gigabyte file into chunks of about 1.5 gigabytes using Linux split?

linux

shell

logic

split