awk数组遍历中的字母数字或"Version sort"

alphanumeric or "Version sort" in awk array traversal

我使用的生物信息学文件在一行中包含字符和数字的组合,如下所示

编辑更新示例

chr1    1    100
chr10   1    500
chr2    33   52
chr5    11   66
chr22   99   1052
chr11   444  2141
chr2    555  1200
chr7    300   444
chr7    44   222
chr21   24   6023
chr16   224  5521
chr3    3    200
chrX    6234    79593
chr1    5291    5500
chrY    204  310

我想这样排序

chr1    1    100
chr1    5291    5500
chr2    33   52
chr2    555  1200
chr3    3    200
chr5    11   66
chr7    44   222
chr7    300   444
chr10   1    500
chr11   444  2141
chr16   224  5521
chr21   24   6023
chr22   99   1052
chrX    6234    79593
chrY    204  310

我正在使用受控数组横向对它们进行排序,因为我使用 for 循环扫描它们,但已建立的排序方法是数字、字符串或类型排序。我想要与 GNU sort -V 排序完全一样的东西。在下面的 link 中,我看到我可以构建一个用于排序的自定义函数,但我不确定如何对字母数字值执行此操作。

我想做这样的事情,分别对字母值和数字部分进行排序,然后想办法将它们组合起来,但我不确定具体该怎么做。

echo "d23FgE55" | awk '{alph=;num=;gsub("[0-9]+","-",alph);gsub("[A-z]+","-",num);print alph,num}'
d-FgE- -23-55

编辑

抱歉,其中一个答案让我想起了我 运行 有时遇到的一些更复杂的例子,这让我问了一些更接近 sort -V 行为的问题。

未排序示例

chr1_KI270708v1_random  9240458 9393655
chrUn_KI270589v1        5789405 6182867
chr19_GL383576v1_alt    4363702 4753141
chr8_KI270820v1_alt     1008865 1426444
chrUn_GL000220v1        7612088 7825236
chrUn_KI270591v1        9457975 9812609
chr19_GL949747v2_alt    1578276 2033118
chr13_KI270841v1_alt    8680841 9033557
chr17_KI270859v1_alt    3996864 4344945
chr20_GL383577v2_alt    3002112 3480396
chrUn_KI270322v1        1563181 1629375
chrUn_KI270419v1        5482364 5893900
chrUn_KI270310v1        4229845 4626802
chr17_KI270907v1_alt    1735306 2201566
chr9    2052238 2476827
chr1_KI270713v1_random  7260088 7358876
chr19_KI270891v1_alt    7485890 7719006
chr19_KI270917v1_alt    7816269 7864474
chrUn_KI270378v1        8848225 9158581
chr19_KI270933v1_alt    1600444 2095219
chr3_KI270937v1_alt     9902343 10113942
chr12_KI270904v1_alt    6924313 7067502
chrUn_GL000214v1        2468418 2728693
chr19_KI270884v1_alt    2760167 3027068
chrUn_KI270582v1        6390491 6398266
chr5_KI270897v1_alt     324409  548890
chr1_GL383520v2_alt     3718510 3948906
chr19_KI270888v1_alt    3809944 3824324
chr11_JH159137v1_alt    6090599 6480896
chr1_KI270762v1_alt     4196600 4684821
chr11_KI270831v1_alt    2112401 2490081
chr5_GL339449v2_alt     7220557 7718111
chr19_KI270889v1_alt    8709969 8823256
chr19_KI270931v1_alt    6356216 6811953
chr5_GL383530v1_alt     109389  128439
chr11_KI270927v1_alt    2119470 2348459
chr17_KI270730v1_random 404268  854145
chrUn_KI270387v1        7430161 7648856
chr10   2656835 2873499
chr19_GL383573v1_alt    7863497 7896279
chrUn_KI270741v1        1292450 1371680
chrUn_KI270335v1        7360266 7748998
chr9_GL383539v1_alt     3394859 3499461
chrX    4490524 4892623
chrUn_KI270311v1        963681  1069745
chr11_JH159136v1_alt    8171978 8319851
chr17_JH159148v1_alt    3708868 3984631
chrUn_KI270544v1        7025954 7392905
chr19_KI270888v1_alt    9683785 10166473
chrUn_KI270521v1        6924036 7001552
chr1_KI270710v1_random  8843336 9304602
chr19_GL949746v1_alt    8572018 8832793
chrUn_KI270322v1        9392841 9512920
chrUn_KI270366v1        3332191 3576201
chr13_KI270841v1_alt    9033828 9276044
chr19_GL949748v2_alt    3321575 3743545
chr16   3704062 4122526
chr17_GL383563v3_alt    7476487 7845527
chr8_KI270810v1_alt     8499569 8873953
chr22_KI270732v1_random 9187626 9418154
chr20_GL383577v2_alt    3031797 3421627
chr9_GL383541v1_alt     3771485 3927905
chr9_KI270720v1_random  8948742 9157627
chr16_KI270854v1_alt    7278604 7297845
chr1_KI270763v1_alt     1518275 1527443
chrUn_KI270517v1        3377374 3454859
chr15_KI270850v1_alt    6358541 6822565
chr22_KB663609v1_alt    4944645 4971886
chr7_KI270806v1_alt     9032201 9471652
chrUn_KI270438v1        8523562 8944980
chr17_KI270730v1_random 3544067 3796807
chr18   1856815 2144546
chr20_KI270869v1_alt    2269342 2353172
chr5_GL383530v1_alt     2139701 2285854
chr8_KI270819v1_alt     7048265 7503415
chr17_JH159148v1_alt    7040113 7042904
chr5_KI270793v1_alt     9363008 9816819
chr19_KI270931v1_alt    1794178 2143519
chr3_KI270778v1_alt     2228100 2549359
chr19_KI270932v1_alt    8320855 8486835
chr12_GL877876v1_alt    3736839 3820171
chr12_GL877876v1_alt    2805577 2974710
chr4_KI270789v1_alt     3309756 3669565
chr19_KI270917v1_alt    9443280 9678387
chr11_KI270831v1_alt    8603751 9028904
chrUn_KI270387v1        5189812 5439563
chrUn_KI270507v1        599827  666674
chr1_KI270711v1_random  6111532 6446935
chr2_KI270773v1_alt     8604675 8922311
chr16_KI270856v1_alt    5855373 6089898
chr3    3097856 3436204
chr13_KI270840v1_alt    3127654 3295739
chr15_KI270849v1_alt    8948327 9336376
chr18_KI270863v1_alt    3984764 4166006
chr19_KI270920v1_alt    5554949 5919294
chr12   6624798 7106478
chr4_GL383528v1_alt     2099770 2280753
chr2_KI270769v1_alt     8706329 9147304
chr8_KI270812v1_alt     6146462 6388606

排序输出

chr1_GL383520v2_alt     3718510 3948906
chr1_KI270708v1_random  9240458 9393655
chr1_KI270710v1_random  8843336 9304602
chr1_KI270711v1_random  6111532 6446935
chr1_KI270713v1_random  7260088 7358876
chr1_KI270762v1_alt     4196600 4684821
chr1_KI270763v1_alt     1518275 1527443
chr2_KI270769v1_alt     8706329 9147304
chr2_KI270773v1_alt     8604675 8922311
chr3    3097856 3436204
chr3_KI270778v1_alt     2228100 2549359
chr3_KI270937v1_alt     9902343 10113942
chr4_GL383528v1_alt     2099770 2280753
chr4_KI270789v1_alt     3309756 3669565
chr5_GL339449v2_alt     7220557 7718111
chr5_GL383530v1_alt     109389  128439
chr5_GL383530v1_alt     2139701 2285854
chr5_KI270793v1_alt     9363008 9816819
chr5_KI270897v1_alt     324409  548890
chr7_KI270806v1_alt     9032201 9471652
chr8_KI270810v1_alt     8499569 8873953
chr8_KI270812v1_alt     6146462 6388606
chr8_KI270819v1_alt     7048265 7503415
chr8_KI270820v1_alt     1008865 1426444
chr9    2052238 2476827
chr9_GL383539v1_alt     3394859 3499461
chr9_GL383541v1_alt     3771485 3927905
chr9_KI270720v1_random  8948742 9157627
chr10   2656835 2873499
chr11_JH159136v1_alt    8171978 8319851
chr11_JH159137v1_alt    6090599 6480896
chr11_KI270831v1_alt    2112401 2490081
chr11_KI270831v1_alt    8603751 9028904
chr11_KI270927v1_alt    2119470 2348459
chr12   6624798 7106478
chr12_GL877876v1_alt    2805577 2974710
chr12_GL877876v1_alt    3736839 3820171
chr12_KI270904v1_alt    6924313 7067502
chr13_KI270840v1_alt    3127654 3295739
chr13_KI270841v1_alt    8680841 9033557
chr13_KI270841v1_alt    9033828 9276044
chr15_KI270849v1_alt    8948327 9336376
chr15_KI270850v1_alt    6358541 6822565
chr16   3704062 4122526
chr16_KI270854v1_alt    7278604 7297845
chr16_KI270856v1_alt    5855373 6089898
chr17_GL383563v3_alt    7476487 7845527
chr17_JH159148v1_alt    3708868 3984631
chr17_JH159148v1_alt    7040113 7042904
chr17_KI270730v1_random 404268  854145
chr17_KI270730v1_random 3544067 3796807
chr17_KI270859v1_alt    3996864 4344945
chr17_KI270907v1_alt    1735306 2201566
chr18   1856815 2144546
chr18_KI270863v1_alt    3984764 4166006
chr19_GL383573v1_alt    7863497 7896279
chr19_GL383576v1_alt    4363702 4753141
chr19_GL949746v1_alt    8572018 8832793
chr19_GL949747v2_alt    1578276 2033118
chr19_GL949748v2_alt    3321575 3743545
chr19_KI270884v1_alt    2760167 3027068
chr19_KI270888v1_alt    3809944 3824324
chr19_KI270888v1_alt    9683785 10166473
chr19_KI270889v1_alt    8709969 8823256
chr19_KI270891v1_alt    7485890 7719006
chr19_KI270917v1_alt    7816269 7864474
chr19_KI270917v1_alt    9443280 9678387
chr19_KI270920v1_alt    5554949 5919294
chr19_KI270931v1_alt    1794178 2143519
chr19_KI270931v1_alt    6356216 6811953
chr19_KI270932v1_alt    8320855 8486835
chr19_KI270933v1_alt    1600444 2095219
chr20_GL383577v2_alt    3002112 3480396
chr20_GL383577v2_alt    3031797 3421627
chr20_KI270869v1_alt    2269342 2353172
chr22_KB663609v1_alt    4944645 4971886
chr22_KI270732v1_random 9187626 9418154
chrUn_GL000214v1        2468418 2728693
chrUn_GL000220v1        7612088 7825236
chrUn_KI270310v1        4229845 4626802
chrUn_KI270311v1        963681  1069745
chrUn_KI270322v1        1563181 1629375
chrUn_KI270322v1        9392841 9512920
chrUn_KI270335v1        7360266 7748998
chrUn_KI270366v1        3332191 3576201
chrUn_KI270378v1        8848225 9158581
chrUn_KI270387v1        5189812 5439563
chrUn_KI270387v1        7430161 7648856
chrUn_KI270419v1        5482364 5893900
chrUn_KI270438v1        8523562 8944980
chrUn_KI270507v1        599827  666674
chrUn_KI270517v1        3377374 3454859
chrUn_KI270521v1        6924036 7001552
chrUn_KI270544v1        7025954 7392905
chrUn_KI270582v1        6390491 6398266
chrUn_KI270589v1        5789405 6182867
chrUn_KI270591v1        9457975 9812609
chrUn_KI270741v1        1292450 1371680
chrX    4490524 4892623

https://www.gnu.org/software/gawk/manual/html_node/Controlling-Array-Traversal.html

一种 gnu-awk 方法,将输入拆分为数字和字符串数组,然后使用 2 种不同的排序类型:

awk '{
   v = gensub(/^[^0-9]+/, "", "1", )
   if (v ~ /[0-9]/)
      narr[v,] = [=10=]
   else
      sarr[v,] = [=10=]
}
END {
   PROCINFO["sorted_in"]="@ind_num_asc"
   for (i in narr) print narr[i]
   PROCINFO["sorted_in"]="@ind_str_asc"
   for (i in sarr) print sarr[i]
}' file

输出:

chr1_GL383520v2_alt     3718510 3948906
chr1_KI270708v1_random  9240458 9393655
chr1_KI270710v1_random  8843336 9304602
chr1_KI270711v1_random  6111532 6446935
chr1_KI270713v1_random  7260088 7358876
chr1_KI270762v1_alt     4196600 4684821
chr1_KI270763v1_alt     1518275 1527443
chr2_KI270769v1_alt     8706329 9147304
chr2_KI270773v1_alt     8604675 8922311
chr3    3097856 3436204
chr3_KI270778v1_alt     2228100 2549359
chr3_KI270937v1_alt     9902343 10113942
chr4_GL383528v1_alt     2099770 2280753
chr4_KI270789v1_alt     3309756 3669565
chr5_GL339449v2_alt     7220557 7718111
chr5_GL383530v1_alt     109389  128439
chr5_GL383530v1_alt     2139701 2285854
chr5_KI270793v1_alt     9363008 9816819
chr5_KI270897v1_alt     324409  548890
chr7_KI270806v1_alt     9032201 9471652
chr8_KI270810v1_alt     8499569 8873953
chr8_KI270812v1_alt     6146462 6388606
chr8_KI270819v1_alt     7048265 7503415
chr8_KI270820v1_alt     1008865 1426444
chr9    2052238 2476827
chr9_GL383539v1_alt     3394859 3499461
chr9_GL383541v1_alt     3771485 3927905
chr9_KI270720v1_random  8948742 9157627
chr10   2656835 2873499
chr11_JH159136v1_alt    8171978 8319851
chr11_JH159137v1_alt    6090599 6480896
chr11_KI270831v1_alt    2112401 2490081
chr11_KI270831v1_alt    8603751 9028904
chr11_KI270927v1_alt    2119470 2348459
chr12   6624798 7106478
chr12_GL877876v1_alt    2805577 2974710
chr12_GL877876v1_alt    3736839 3820171
chr12_KI270904v1_alt    6924313 7067502
chr13_KI270840v1_alt    3127654 3295739
chr13_KI270841v1_alt    8680841 9033557
chr13_KI270841v1_alt    9033828 9276044
chr15_KI270849v1_alt    8948327 9336376
chr15_KI270850v1_alt    6358541 6822565
chr16   3704062 4122526
chr16_KI270854v1_alt    7278604 7297845
chr16_KI270856v1_alt    5855373 6089898
chr17_GL383563v3_alt    7476487 7845527
chr17_JH159148v1_alt    3708868 3984631
chr17_JH159148v1_alt    7040113 7042904
chr17_KI270730v1_random 3544067 3796807
chr17_KI270730v1_random 404268  854145
chr17_KI270859v1_alt    3996864 4344945
chr17_KI270907v1_alt    1735306 2201566
chr18   1856815 2144546
chr18_KI270863v1_alt    3984764 4166006
chr19_GL383573v1_alt    7863497 7896279
chr19_GL383576v1_alt    4363702 4753141
chr19_GL949746v1_alt    8572018 8832793
chr19_GL949747v2_alt    1578276 2033118
chr19_GL949748v2_alt    3321575 3743545
chr19_KI270884v1_alt    2760167 3027068
chr19_KI270888v1_alt    3809944 3824324
chr19_KI270888v1_alt    9683785 10166473
chr19_KI270889v1_alt    8709969 8823256
chr19_KI270891v1_alt    7485890 7719006
chr19_KI270917v1_alt    7816269 7864474
chr19_KI270917v1_alt    9443280 9678387
chr19_KI270920v1_alt    5554949 5919294
chr19_KI270931v1_alt    1794178 2143519
chr19_KI270931v1_alt    6356216 6811953
chr19_KI270932v1_alt    8320855 8486835
chr19_KI270933v1_alt    1600444 2095219
chr20_GL383577v2_alt    3002112 3480396
chr20_GL383577v2_alt    3031797 3421627
chr20_KI270869v1_alt    2269342 2353172
chr22_KB663609v1_alt    4944645 4971886
chr22_KI270732v1_random 9187626 9418154
chrUn_GL000214v1        2468418 2728693
chrUn_GL000220v1        7612088 7825236
chrUn_KI270310v1        4229845 4626802
chrUn_KI270311v1        963681  1069745
chrUn_KI270322v1        1563181 1629375
chrUn_KI270322v1        9392841 9512920
chrUn_KI270335v1        7360266 7748998
chrUn_KI270366v1        3332191 3576201
chrUn_KI270378v1        8848225 9158581
chrUn_KI270387v1        5189812 5439563
chrUn_KI270387v1        7430161 7648856
chrUn_KI270419v1        5482364 5893900
chrUn_KI270438v1        8523562 8944980
chrUn_KI270507v1        599827  666674
chrUn_KI270517v1        3377374 3454859
chrUn_KI270521v1        6924036 7001552
chrUn_KI270544v1        7025954 7392905
chrUn_KI270582v1        6390491 6398266
chrUn_KI270589v1        5789405 6182867
chrUn_KI270591v1        9457975 9812609
chrUn_KI270741v1        1292450 1371680
chrX    4490524 4892623

如果您不介意在数字排序中字母排在数字之前(或者可以找到一个语言环境,但情况并非如此),这可能就是您所需要的:

$ cat tst.awk
{ vals[gensub(/chr/,"",1,)][][] }
END {
    OFS = "\t"

    PROCINFO["sorted_in"] = "@ind_num_asc"
    for (chr in vals) {
        for (beg in vals[chr]) {
            for (end in vals[chr][beg]) {
                print "chr"chr, beg, end
            }
        }
    }
}

$ awk -f tst.awk file
chrX    6234    79593
chrY    204     310
chr1    1       100
chr1    5291    5500
chr2    33      52
chr2    555     1200
chr3    3       200
chr5    11      66
chr7    44      222
chr7    300     444
chr10   1       500
chr11   444     2141
chr16   224     5521
chr21   24      6023
chr22   99      1052

以上使用 GNU awk 作为数组的数组,sorted_in。

假设:

  • 第一个字段将始终至少以一个字母开头,后跟 0 个或多个数字(即 [:alpha:]+[:digit:]*
  • 字母和数字永远不会混合(例如,d23FgE55 不会出现)
  • 第一个字段仅由字母和数字组成(即[:alnum:]
  • 整个输入文件适合约 60% 的可用内存(否则将整个文件加载到 awk 数组可能会出现 'out of memory' 错误)

向输入添加更多数据点:

$ cat dat.raw
chr1    1    100
chr10   1    500
chr2    33   52
bec9   3   17
chr5    11   66
chr22   99   1052
chr11   444  2141
chr2    555  1200
chr7    300   444
defA   3    15
def7   13    15
def7   3    15
chr7    44   222
chr21   24   6023
chr16   224  5521
chr3    3    200
chrX    6234    79593
chr1    5291    5500
chrY    204  310

对数组的数组使用 GNU awkPROCINFO["sorted_in"]:

awk '
     { keyL=gensub(/([[:digit:]])/,"","g",)    # strip numbers from 1st field
       keyN=gensub(/([[:alpha:]])/,"","g",)    # strip letters from 1st field

       if ( keyN == "" ) keyN=999999999          # if no numbers in 1st field use a really big number to insure these rows go to the end of the list

       arr[keyL][keyN][]=[=11=]                    # arr[] index = 1st field letters / 1st field numbers / 2nd field
     }

END  { PROCINFO["sorted_in"]="@ind_str_asc"      # sort 1st index as string
       for (i in arr) {
           PROCINFO["sorted_in"]="@ind_num_asc"  # sort 2nd/3rd indices as numbers
           for (j in arr[i])
               for (k in arr[i][j])
                   print arr[i][j][k]
       }
     }
' dat.raw

这会生成:

bec9   3   17
chr1    1    100
chr1    5291    5500
chr2    33   52
chr2    555  1200
chr3    3    200
chr5    11   66
chr7    44   222
chr7    300   444
chr10   1    500
chr11   444  2141
chr16   224  5521
chr21   24   6023
chr22   99   1052
chrX    6234    79593
chrY    204  310
def7   3    15
def7   13    15
defA   3    15

GNU 实用程序的版本排序 rules 完全实现起来并不容易。

鉴于你的第二个例子 unsorteddesired:

head -n 5 unsorted
chr1_KI270708v1_random  9240458 9393655
chrUn_KI270589v1        5789405 6182867
chr19_GL383576v1_alt    4363702 4753141
chr8_KI270820v1_alt     1008865 1426444
chrUn_GL000220v1        7612088 7825236
...

head -n 5 desired
chr1_GL383520v2_alt     3718510 3948906
chr1_KI270708v1_random  9240458 9393655
chr1_KI270710v1_random  8843336 9304602
chr1_KI270711v1_random  6111532 6446935
chr1_KI270713v1_random  7260088 7358876
...

这里是ruby。它完全适用于您的第二个示例:

ruby -e 'ar=$<.read.split(/\r?\n/); 
puts ar.sort_by {|e| e.split(/(\d+)/).map {|a| a =~ /\d+/ ? a.to_i : a }}.
    join("\n")' unsorted >t_ruby

现在比较 t_rubydesired:

awk 'FNR==NR{idx[[=12=]]=FNR; next}
{print [=12=], FNR, idx[[=12=]]}
' t_ruby desired | head -n 10
chr1_GL383520v2_alt     3718510 3948906 1 1
chr1_KI270708v1_random  9240458 9393655 2 2
chr1_KI270710v1_random  8843336 9304602 3 3
chr1_KI270711v1_random  6111532 6446935 4 4
chr1_KI270713v1_random  7260088 7358876 5 5
chr1_KI270762v1_alt     4196600 4684821 6 6
chr1_KI270763v1_alt     1518275 1527443 7 7
chr2_KI270769v1_alt     8706329 9147304 8 8
chr2_KI270773v1_alt     8604675 8922311 9 9
chr3    3097856 3436204 10 10
... line by line the same ...

当我发布这个问题时,这就是我要找的。我相信这个算法可以加速和缩短很多。它将字符串拆分为数字部分和字符串部分,并逐节进行比较。我相信它的行为与 gnu sort -v 相同,除了它如何处理 .

我喜欢其他答案,因为它们 运行 很快,但我认为这个函数应该适用于任意长和奇怪的字母数字组。

awk '
function ivsort(i1,v1,i2,v2){
    s=sprintf("%c",0)
    split(gensub("[0-9]+",s"&"s,"g",i1),f1,s)
    split(gensub("[0-9]+",s"&"s,"g",i2),f2,s)
    for (val=1;val<=length(f1);val++){
        if (!(val in f2)){
            return -1
        }
        if (f1[val]==f2[val]){
            continue
        }
        if (f1[val]+0 != f1[val] && f2[val]+0 != f2[val]){
            return (f1[val]>f2[val])?1:-1
        }
        else{
            return (f1[val]+0)-(f2[val]+0)
        }
    }
    if (length(f2) > length(f1))
        return -1
    else
        return length(i1) - length(i2)
}

到 运行 将该函数添加到 awk 程序的顶部并将其设置为 PROCINFO["sorted_in"]

的值
PROCINFO["sorted_in"] = "vsort"
    for (x in a){
        print x
    }