通过标志的组值平滑地拆分成固定序列

Question

我想尽可能均匀地分割，通过改变文件数量但有一些最小文件数和最大文件数，输出文件大小之间的差异不超过一个字节。关于序列加载行为的讨论的第一个线程我提供的案例太少无法解释序列行为，但增量导致最后一个序列增加 5 个字符。可以使用不同条件的标志。

数据结构

仅使用明确定义的算法无法完成此平滑处理。我只是有一种直觉，部分索引可以工作，因为总是只有一个小子集有数据，并且平滑通过目录的条目动态发生。该解决方案可能涉及一些精心选择的数据结构和一些算法。

算法伪代码

我想影响角色加载到结果文件中的行为，这种行为目前发生在相当不合逻辑且不顺利的情况下。

$ seq -w 0 0.0001 1                                \
| gsed 's/\.//g'                                   \
| gsed ':a;N;$!ba;s/\n//g' > /tmp/k                \
&& gsplit -n{a,b} -e -b{k,n,m} /tmp/k              \
&& wc -c 1stFile && wc -c lastFile

其中

部分命令gsplit -n{a,b} -b{k,n,m}只是伪命令
可以使用标记 n 和 b
-e 从输出中删除空文件，但仅靠它不足以强制输出处于某个间隔
带有单个值的标志 -n 会导致固定数量的文件，但可以通过 -e 将其缩小到最小值，因此这里可能的情况是分别处理组的每个单元。
标志 -n 的确定性，当只有一个值时，会导致将序列奇怪地加载到文件中。

如何更好地控制将新序列加载到新文件中而不会在某些文件中出现快速峰值？

Answer 1

这是一个 shell 脚本，可以计算出允许的组合给定各种参数的文件大小和数量。它会退出如果找到任何组合则成功，如果失败则退出没有找到给定输入的可能组合。注意并非所有可能的参数组合都有解决方案。如果有必要提供解决方案，允许的数量文件可以增加或减少。两个琐碎的案例文件或多个等于字节数的文件总是可解。

#!/bin/sh

# N is the bytes total.
# L is the lowest number of files allowable.
# H is the highest number of files allowable.
# F is the actual number of files used
# B is the minimum bytes per file
# R is the remaining bytes if all files are of size B
# K is the maximum number of files allowed to be one byte larger than the
# minimum, K < F
# 
# So, you need to determine if there is some L <= F <= H such that R <= K.
# 
# For a given candidate F:
# B = floor(N / F)
# R = N % B
# if R <= K then the candidate F is allowable, F files will be used,
# R of them will be of size B+1 and F-R of them will be of size B.

# usage: <program> <bytes> <min files> <max files> [max larger files]
# copyright disclaimed, this program is in the public domain

N=
L=
H=
K=${4:-1} # default to one file allowed to be larger

status=1
echo checking number of files F: $L '<= F <=' $H, at most $K one byte larger
for F in $(seq $L $H); do
        B=$(($N / $F))
        R=$(($N % $B))
        if [ $R -le $K ]; then
                if [ $R -eq 0 ]; then
                echo Usable: $F files, size $B
                else
                echo Usable: $F files, $(($F - $R)) size $B, $R size $(($B+1))
                fi
                status=0;
        fi
done
exit $status

一些例子：

一个较大的素数字节：

% sh trysplit 16769023 3 100; echo $?
checking number of files F: 3 <= F <= 100, at most 1 files one byte larger
Usable: 3 files, 2 size 5589674, 1 size 5589675
Usable: 6 files, 5 size 2794837, 1 size 2794838
Usable: 61 files, 60 size 274902, 1 size 274903
0
%

嗯，它有一些解决方案，但是呃。

一个幸运的数字怎么样：

% sh trysplit 16769024 3 100; echo $?
checking number of files F: 3 <= F <= 100, at most 1 files one byte larger
Usable: 4 files, size 4192256
Usable: 8 files, size 2096128
Usable: 16 files, size 1048064
Usable: 23 files, size 729088
Usable: 32 files, size 524032
Usable: 46 files, size 364544
Usable: 64 files, size 262016
Usable: 89 files, size 188416
Usable: 92 files, size 182272
0
%

大一个字节，你有很多选择。

如果我们允许多个文件变大怎么办：

% sh trysplit 16769023 3 100 2; echo $?
checking number of files F: 3 <= F <= 100, at most 2 files one byte larger
Usable: 3 files, 2 size 5589674, 1 size 5589675
Usable: 6 files, 5 size 2794837, 1 size 2794838
Usable: 17 files, 15 size 986413, 2 size 986414
Usable: 61 files, 60 size 274902, 1 size 274903
0
%

如果其中任何一个可以更大呢？我认为在这种情况下，但是还没有证明，你可以使用你想要的任意数量的文件，它只会影响一个字节有多少的分配更大。您可以使用脚本查看确切的数量您想要的文件通过设置最小和最大文件来工作相同和允许相差小于1

这可以修改为只打印出您感兴趣的参数 in 所以你可以用它来填充一个 shell 变量，然后可以用于构造适当的拆分命令。

通过标志的组值平滑地拆分成固定序列

To Split into fixed sequences smoothly by group values of flags

unix

algorithm

shell

split

data-structures

数据结构

算法伪代码