对 awk 关联数组元素的内容进行排序

Sort contents of awk associative array element

最初,文件的内容如下:

1.2.3.4: 1,3,4
1.2.3.5: 9,8,7,6
1.2.3.4: 4,5,6
1.2.3.6: 1,1,1

在我尝试错误排序之后,我得到了这个:

1.2.3.4: 1,3,4,4,5,6,
1.2.3.5: 9,8,7,6,
1.2.3.6: 1,1,1,

我想整理成以下格式:

1.2.3.4: 1,3,4,5,6
1.2.3.5: 6,7,8,9
1.2.3.6: 1

但是我如何访问每个元素中的每个逗号分隔字符并以唯一的升序对它们进行排序以删除重复项?到目前为止我设法使用的唯一 shell 脚本只能访问整个元素:

awk -F' ' 'NF>1{a[] = a[]","}END{for(i in a){print i" "a[i] | "sort -t: -k1 "}}' c.txt

编辑:我第一次将中间数据作为输入,当时原始数据尚未发布,但当然也可以从原始数据中获取。再次使用 GNU awk:

gawk -F '[ ,]' 'BEGIN { PROCINFO["sorted_in"] = "@ind_num_asc" } { for(i = 2; i <= NF; ++i) a[][$i]; } END { for(ip in a) { line = ip " "; for(n in a[ip]) { line = line n "," } sub(/,$/, "", line); print line } }' filename

代码的工作原理如下:

BEGIN { 
  PROCINFO["sorted_in"] = "@ind_num_asc"  # GNU-specific: sorted array
                                          # traversal
}
{
  for(i = 2; i <= NF; ++i) a[][$i]      # remember numbers by ip
}
END {                                     # in the end:
  for(ip in a) {                          # for all ips:
    line = ip " "                         # construct the line: IP
    for(n in a[ip]) {                     # numbers in order
      line = line n ","
    }
    sub(/,$/, "", line)                   # remove trailing comma
    print line                            # print the result.
  }
}

中间数据的旧答案:

使用 GNU awk,假设数据的格式与问题中的完全相同(尾随 ,):

gawk -F '[ ,]' 'BEGIN { PROCINFO["sorted_in"] = "@ind_num_asc" } { delete a; for(i = 2; i < NF; ++i) a[$i]; line =  " "; for(i in a) line = line i ","; sub(/,$/, "", line); print line; }' filename

文件内容按空格和逗号分隔,则代码如下:

BEGIN { 
  PROCINFO["sorted_in"] = "@ind_num_asc"  # GNU-specific: sorted array
                                          # traversal, numerically ascending
}
{
  delete a
  for(i = 2; i < NF; ++i) { a[$i] }       # remember the fields in a line.
                                          # duplicates are removed here.
                                          # note that it's < NF instead of
                                          # <= NF because the trailing comma
                                          # leaves us with an empty last
                                          # field.

  line =  " "                           # start building line: IP field
  for(i in a) {                           # append numbers separated by
    line = line i ","                     # commas
  }
  sub(/,$/, "", line)                     # remove last trailing comma
  print line                              # print result.
}