高效的前 10 排序多列

Question

寻找一种有效的方法来从 Linux 服务器上的大型数据集中获取多个数字字段的前 3 个（可扩展）；这是的后续内容，其中对

的最佳建议

"I have an awk array that aggregates bytes up and downloaded. I can sort the output by either bytes down or up and pipe that to head for the top talkers; is it possible to output two sorts using different keys?"

是：

zgrep '^1' 20211014T00*.gz|
awk '
    NR > 1 {
         key =  " " 
         bytesdown[key] += 
         bytesup[key] += 
     }
     END {
         cmd = "sort -rn | head -3"
         for ( key in bytesDown ) {
             print bytesDown[key], bytesUp[key], key | cmd
         }
         close(cmd)
 
         cmd = "sort -rnk2 | head -3"
         for ( key in bytesDown ) {
             print bytesDown[key], bytesUp[key], key | cmd
         }
         close(cmd)
     }
'

但是，由于数据集的范围可以从 1000 行到数百万行，而不是将整个集合读入数组，排序并丢弃绝大多数，维护前 10 行的数组是否可行数据读入？绝对速度比内存消耗更不是问题，内存消耗是服务器上相对有限的资源。

例如，给定以下示例输入：

ip1     fqdn101 101     10
ip2     fqdn102 102     11
ip3     fqdn103 103     12
ip4     fqdn104 104     13
ip1     fqdn101 105     14
ip1     fqdn102 106     15
ip1     fqdn103 107     16
ip1     fqdn104 108     17
ip2     fqdn103 109     16
ip2     fqdn104 110     17

那应该输出

ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip2 fqdn103 109 16

和

ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip1 fqdn104 108 17

对 awk 以外的选项开放 - 虽然那将是我的默认起点 - 只要它们在公司 Linux 服务器构建上可用，我就得到...

Answer 1

既然你说“绝对速度比内存消耗更重要”，下面是如何使用最少的内存快速完成你想做的事情：

$ cat tst.sh
#!/usr/bin/env bash

tmp=$(mktemp) || exit 1
trap 'rm -f "$tmp"; exit' 0

sort -k1,1 -k2,2 "${@:--}" |
awk '
    { key =  " "  }
    key != prev {
        if ( NR>1 ) {
            print prev, tot3, tot4
        }
        tot3 = tot4 = 0
        prev = key
    }
    {
        tot3 += 
        tot4 += 
    }
    END {
        print key, tot3, tot4
    }
' > "$tmp"

sort -nrk3,3 "$tmp" | head -3
printf '\n'
sort -nrk4,4 "$tmp" | head -3

$ ./tst.sh file
ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip2 fqdn103 109 16

ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip1 fqdn104 108 17

上面唯一必须一次处理整个输入的工具是 sort，它设计用于使用请求分页等来处理大文件，因此不需要能够将整个输入存储在内存中以便能够工作。上面使用的临时文件将比您的原始输入文件小得多，因为它只存储每个密钥对的总数。

如果您不想使用临时文件，那么您可以这样做，这将花费更长的时间（可能是 1.5 倍？）到运行:

$ cat tst.sh
#!/usr/bin/env bash

do1tot() {
    local totPos=""
    shift

    sort -k1,1 -k2,2 "${@:--}" |
    awk '
        { key =  " "  }
        key != prev {
            if ( NR>1 ) {
                print prev, tot3, tot4
            }
            tot3 = tot4 = 0
            prev = key
        }
        {
            tot3 += 
            tot4 += 
        }
        END {
            print key, tot3, tot4
        }
    ' |
    sort -nrk"$totPos,$totPos" |
    head -3
}

do1tot 3 "${@:--}"
printf "\n"
do1tot 4 "${@:--}"

$ ./tst.sh file
ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip2 fqdn103 109 16

ip1 fqdn101 206 24
ip2 fqdn104 110 17
ip1 fqdn104 108 17

高效的前 10 排序多列

Efficient top 10 sort of multiple columns

sorting

awk