awk数组遍历中的字母数字或"Version sort"
alphanumeric or "Version sort" in awk array traversal
我使用的生物信息学文件在一行中包含字符和数字的组合,如下所示
编辑更新示例
chr1 1 100
chr10 1 500
chr2 33 52
chr5 11 66
chr22 99 1052
chr11 444 2141
chr2 555 1200
chr7 300 444
chr7 44 222
chr21 24 6023
chr16 224 5521
chr3 3 200
chrX 6234 79593
chr1 5291 5500
chrY 204 310
我想这样排序
chr1 1 100
chr1 5291 5500
chr2 33 52
chr2 555 1200
chr3 3 200
chr5 11 66
chr7 44 222
chr7 300 444
chr10 1 500
chr11 444 2141
chr16 224 5521
chr21 24 6023
chr22 99 1052
chrX 6234 79593
chrY 204 310
我正在使用受控数组横向对它们进行排序,因为我使用 for 循环扫描它们,但已建立的排序方法是数字、字符串或类型排序。我想要与 GNU sort -V
排序完全一样的东西。在下面的 link 中,我看到我可以构建一个用于排序的自定义函数,但我不确定如何对字母数字值执行此操作。
我想做这样的事情,分别对字母值和数字部分进行排序,然后想办法将它们组合起来,但我不确定具体该怎么做。
echo "d23FgE55" | awk '{alph=;num=;gsub("[0-9]+","-",alph);gsub("[A-z]+","-",num);print alph,num}'
d-FgE- -23-55
编辑
抱歉,其中一个答案让我想起了我 运行 有时遇到的一些更复杂的例子,这让我问了一些更接近 sort -V
行为的问题。
未排序示例
chr1_KI270708v1_random 9240458 9393655
chrUn_KI270589v1 5789405 6182867
chr19_GL383576v1_alt 4363702 4753141
chr8_KI270820v1_alt 1008865 1426444
chrUn_GL000220v1 7612088 7825236
chrUn_KI270591v1 9457975 9812609
chr19_GL949747v2_alt 1578276 2033118
chr13_KI270841v1_alt 8680841 9033557
chr17_KI270859v1_alt 3996864 4344945
chr20_GL383577v2_alt 3002112 3480396
chrUn_KI270322v1 1563181 1629375
chrUn_KI270419v1 5482364 5893900
chrUn_KI270310v1 4229845 4626802
chr17_KI270907v1_alt 1735306 2201566
chr9 2052238 2476827
chr1_KI270713v1_random 7260088 7358876
chr19_KI270891v1_alt 7485890 7719006
chr19_KI270917v1_alt 7816269 7864474
chrUn_KI270378v1 8848225 9158581
chr19_KI270933v1_alt 1600444 2095219
chr3_KI270937v1_alt 9902343 10113942
chr12_KI270904v1_alt 6924313 7067502
chrUn_GL000214v1 2468418 2728693
chr19_KI270884v1_alt 2760167 3027068
chrUn_KI270582v1 6390491 6398266
chr5_KI270897v1_alt 324409 548890
chr1_GL383520v2_alt 3718510 3948906
chr19_KI270888v1_alt 3809944 3824324
chr11_JH159137v1_alt 6090599 6480896
chr1_KI270762v1_alt 4196600 4684821
chr11_KI270831v1_alt 2112401 2490081
chr5_GL339449v2_alt 7220557 7718111
chr19_KI270889v1_alt 8709969 8823256
chr19_KI270931v1_alt 6356216 6811953
chr5_GL383530v1_alt 109389 128439
chr11_KI270927v1_alt 2119470 2348459
chr17_KI270730v1_random 404268 854145
chrUn_KI270387v1 7430161 7648856
chr10 2656835 2873499
chr19_GL383573v1_alt 7863497 7896279
chrUn_KI270741v1 1292450 1371680
chrUn_KI270335v1 7360266 7748998
chr9_GL383539v1_alt 3394859 3499461
chrX 4490524 4892623
chrUn_KI270311v1 963681 1069745
chr11_JH159136v1_alt 8171978 8319851
chr17_JH159148v1_alt 3708868 3984631
chrUn_KI270544v1 7025954 7392905
chr19_KI270888v1_alt 9683785 10166473
chrUn_KI270521v1 6924036 7001552
chr1_KI270710v1_random 8843336 9304602
chr19_GL949746v1_alt 8572018 8832793
chrUn_KI270322v1 9392841 9512920
chrUn_KI270366v1 3332191 3576201
chr13_KI270841v1_alt 9033828 9276044
chr19_GL949748v2_alt 3321575 3743545
chr16 3704062 4122526
chr17_GL383563v3_alt 7476487 7845527
chr8_KI270810v1_alt 8499569 8873953
chr22_KI270732v1_random 9187626 9418154
chr20_GL383577v2_alt 3031797 3421627
chr9_GL383541v1_alt 3771485 3927905
chr9_KI270720v1_random 8948742 9157627
chr16_KI270854v1_alt 7278604 7297845
chr1_KI270763v1_alt 1518275 1527443
chrUn_KI270517v1 3377374 3454859
chr15_KI270850v1_alt 6358541 6822565
chr22_KB663609v1_alt 4944645 4971886
chr7_KI270806v1_alt 9032201 9471652
chrUn_KI270438v1 8523562 8944980
chr17_KI270730v1_random 3544067 3796807
chr18 1856815 2144546
chr20_KI270869v1_alt 2269342 2353172
chr5_GL383530v1_alt 2139701 2285854
chr8_KI270819v1_alt 7048265 7503415
chr17_JH159148v1_alt 7040113 7042904
chr5_KI270793v1_alt 9363008 9816819
chr19_KI270931v1_alt 1794178 2143519
chr3_KI270778v1_alt 2228100 2549359
chr19_KI270932v1_alt 8320855 8486835
chr12_GL877876v1_alt 3736839 3820171
chr12_GL877876v1_alt 2805577 2974710
chr4_KI270789v1_alt 3309756 3669565
chr19_KI270917v1_alt 9443280 9678387
chr11_KI270831v1_alt 8603751 9028904
chrUn_KI270387v1 5189812 5439563
chrUn_KI270507v1 599827 666674
chr1_KI270711v1_random 6111532 6446935
chr2_KI270773v1_alt 8604675 8922311
chr16_KI270856v1_alt 5855373 6089898
chr3 3097856 3436204
chr13_KI270840v1_alt 3127654 3295739
chr15_KI270849v1_alt 8948327 9336376
chr18_KI270863v1_alt 3984764 4166006
chr19_KI270920v1_alt 5554949 5919294
chr12 6624798 7106478
chr4_GL383528v1_alt 2099770 2280753
chr2_KI270769v1_alt 8706329 9147304
chr8_KI270812v1_alt 6146462 6388606
排序输出
chr1_GL383520v2_alt 3718510 3948906
chr1_KI270708v1_random 9240458 9393655
chr1_KI270710v1_random 8843336 9304602
chr1_KI270711v1_random 6111532 6446935
chr1_KI270713v1_random 7260088 7358876
chr1_KI270762v1_alt 4196600 4684821
chr1_KI270763v1_alt 1518275 1527443
chr2_KI270769v1_alt 8706329 9147304
chr2_KI270773v1_alt 8604675 8922311
chr3 3097856 3436204
chr3_KI270778v1_alt 2228100 2549359
chr3_KI270937v1_alt 9902343 10113942
chr4_GL383528v1_alt 2099770 2280753
chr4_KI270789v1_alt 3309756 3669565
chr5_GL339449v2_alt 7220557 7718111
chr5_GL383530v1_alt 109389 128439
chr5_GL383530v1_alt 2139701 2285854
chr5_KI270793v1_alt 9363008 9816819
chr5_KI270897v1_alt 324409 548890
chr7_KI270806v1_alt 9032201 9471652
chr8_KI270810v1_alt 8499569 8873953
chr8_KI270812v1_alt 6146462 6388606
chr8_KI270819v1_alt 7048265 7503415
chr8_KI270820v1_alt 1008865 1426444
chr9 2052238 2476827
chr9_GL383539v1_alt 3394859 3499461
chr9_GL383541v1_alt 3771485 3927905
chr9_KI270720v1_random 8948742 9157627
chr10 2656835 2873499
chr11_JH159136v1_alt 8171978 8319851
chr11_JH159137v1_alt 6090599 6480896
chr11_KI270831v1_alt 2112401 2490081
chr11_KI270831v1_alt 8603751 9028904
chr11_KI270927v1_alt 2119470 2348459
chr12 6624798 7106478
chr12_GL877876v1_alt 2805577 2974710
chr12_GL877876v1_alt 3736839 3820171
chr12_KI270904v1_alt 6924313 7067502
chr13_KI270840v1_alt 3127654 3295739
chr13_KI270841v1_alt 8680841 9033557
chr13_KI270841v1_alt 9033828 9276044
chr15_KI270849v1_alt 8948327 9336376
chr15_KI270850v1_alt 6358541 6822565
chr16 3704062 4122526
chr16_KI270854v1_alt 7278604 7297845
chr16_KI270856v1_alt 5855373 6089898
chr17_GL383563v3_alt 7476487 7845527
chr17_JH159148v1_alt 3708868 3984631
chr17_JH159148v1_alt 7040113 7042904
chr17_KI270730v1_random 404268 854145
chr17_KI270730v1_random 3544067 3796807
chr17_KI270859v1_alt 3996864 4344945
chr17_KI270907v1_alt 1735306 2201566
chr18 1856815 2144546
chr18_KI270863v1_alt 3984764 4166006
chr19_GL383573v1_alt 7863497 7896279
chr19_GL383576v1_alt 4363702 4753141
chr19_GL949746v1_alt 8572018 8832793
chr19_GL949747v2_alt 1578276 2033118
chr19_GL949748v2_alt 3321575 3743545
chr19_KI270884v1_alt 2760167 3027068
chr19_KI270888v1_alt 3809944 3824324
chr19_KI270888v1_alt 9683785 10166473
chr19_KI270889v1_alt 8709969 8823256
chr19_KI270891v1_alt 7485890 7719006
chr19_KI270917v1_alt 7816269 7864474
chr19_KI270917v1_alt 9443280 9678387
chr19_KI270920v1_alt 5554949 5919294
chr19_KI270931v1_alt 1794178 2143519
chr19_KI270931v1_alt 6356216 6811953
chr19_KI270932v1_alt 8320855 8486835
chr19_KI270933v1_alt 1600444 2095219
chr20_GL383577v2_alt 3002112 3480396
chr20_GL383577v2_alt 3031797 3421627
chr20_KI270869v1_alt 2269342 2353172
chr22_KB663609v1_alt 4944645 4971886
chr22_KI270732v1_random 9187626 9418154
chrUn_GL000214v1 2468418 2728693
chrUn_GL000220v1 7612088 7825236
chrUn_KI270310v1 4229845 4626802
chrUn_KI270311v1 963681 1069745
chrUn_KI270322v1 1563181 1629375
chrUn_KI270322v1 9392841 9512920
chrUn_KI270335v1 7360266 7748998
chrUn_KI270366v1 3332191 3576201
chrUn_KI270378v1 8848225 9158581
chrUn_KI270387v1 5189812 5439563
chrUn_KI270387v1 7430161 7648856
chrUn_KI270419v1 5482364 5893900
chrUn_KI270438v1 8523562 8944980
chrUn_KI270507v1 599827 666674
chrUn_KI270517v1 3377374 3454859
chrUn_KI270521v1 6924036 7001552
chrUn_KI270544v1 7025954 7392905
chrUn_KI270582v1 6390491 6398266
chrUn_KI270589v1 5789405 6182867
chrUn_KI270591v1 9457975 9812609
chrUn_KI270741v1 1292450 1371680
chrX 4490524 4892623
https://www.gnu.org/software/gawk/manual/html_node/Controlling-Array-Traversal.html
一种 gnu-awk
方法,将输入拆分为数字和字符串数组,然后使用 2 种不同的排序类型:
awk '{
v = gensub(/^[^0-9]+/, "", "1", )
if (v ~ /[0-9]/)
narr[v,] = [=10=]
else
sarr[v,] = [=10=]
}
END {
PROCINFO["sorted_in"]="@ind_num_asc"
for (i in narr) print narr[i]
PROCINFO["sorted_in"]="@ind_str_asc"
for (i in sarr) print sarr[i]
}' file
输出:
chr1_GL383520v2_alt 3718510 3948906
chr1_KI270708v1_random 9240458 9393655
chr1_KI270710v1_random 8843336 9304602
chr1_KI270711v1_random 6111532 6446935
chr1_KI270713v1_random 7260088 7358876
chr1_KI270762v1_alt 4196600 4684821
chr1_KI270763v1_alt 1518275 1527443
chr2_KI270769v1_alt 8706329 9147304
chr2_KI270773v1_alt 8604675 8922311
chr3 3097856 3436204
chr3_KI270778v1_alt 2228100 2549359
chr3_KI270937v1_alt 9902343 10113942
chr4_GL383528v1_alt 2099770 2280753
chr4_KI270789v1_alt 3309756 3669565
chr5_GL339449v2_alt 7220557 7718111
chr5_GL383530v1_alt 109389 128439
chr5_GL383530v1_alt 2139701 2285854
chr5_KI270793v1_alt 9363008 9816819
chr5_KI270897v1_alt 324409 548890
chr7_KI270806v1_alt 9032201 9471652
chr8_KI270810v1_alt 8499569 8873953
chr8_KI270812v1_alt 6146462 6388606
chr8_KI270819v1_alt 7048265 7503415
chr8_KI270820v1_alt 1008865 1426444
chr9 2052238 2476827
chr9_GL383539v1_alt 3394859 3499461
chr9_GL383541v1_alt 3771485 3927905
chr9_KI270720v1_random 8948742 9157627
chr10 2656835 2873499
chr11_JH159136v1_alt 8171978 8319851
chr11_JH159137v1_alt 6090599 6480896
chr11_KI270831v1_alt 2112401 2490081
chr11_KI270831v1_alt 8603751 9028904
chr11_KI270927v1_alt 2119470 2348459
chr12 6624798 7106478
chr12_GL877876v1_alt 2805577 2974710
chr12_GL877876v1_alt 3736839 3820171
chr12_KI270904v1_alt 6924313 7067502
chr13_KI270840v1_alt 3127654 3295739
chr13_KI270841v1_alt 8680841 9033557
chr13_KI270841v1_alt 9033828 9276044
chr15_KI270849v1_alt 8948327 9336376
chr15_KI270850v1_alt 6358541 6822565
chr16 3704062 4122526
chr16_KI270854v1_alt 7278604 7297845
chr16_KI270856v1_alt 5855373 6089898
chr17_GL383563v3_alt 7476487 7845527
chr17_JH159148v1_alt 3708868 3984631
chr17_JH159148v1_alt 7040113 7042904
chr17_KI270730v1_random 3544067 3796807
chr17_KI270730v1_random 404268 854145
chr17_KI270859v1_alt 3996864 4344945
chr17_KI270907v1_alt 1735306 2201566
chr18 1856815 2144546
chr18_KI270863v1_alt 3984764 4166006
chr19_GL383573v1_alt 7863497 7896279
chr19_GL383576v1_alt 4363702 4753141
chr19_GL949746v1_alt 8572018 8832793
chr19_GL949747v2_alt 1578276 2033118
chr19_GL949748v2_alt 3321575 3743545
chr19_KI270884v1_alt 2760167 3027068
chr19_KI270888v1_alt 3809944 3824324
chr19_KI270888v1_alt 9683785 10166473
chr19_KI270889v1_alt 8709969 8823256
chr19_KI270891v1_alt 7485890 7719006
chr19_KI270917v1_alt 7816269 7864474
chr19_KI270917v1_alt 9443280 9678387
chr19_KI270920v1_alt 5554949 5919294
chr19_KI270931v1_alt 1794178 2143519
chr19_KI270931v1_alt 6356216 6811953
chr19_KI270932v1_alt 8320855 8486835
chr19_KI270933v1_alt 1600444 2095219
chr20_GL383577v2_alt 3002112 3480396
chr20_GL383577v2_alt 3031797 3421627
chr20_KI270869v1_alt 2269342 2353172
chr22_KB663609v1_alt 4944645 4971886
chr22_KI270732v1_random 9187626 9418154
chrUn_GL000214v1 2468418 2728693
chrUn_GL000220v1 7612088 7825236
chrUn_KI270310v1 4229845 4626802
chrUn_KI270311v1 963681 1069745
chrUn_KI270322v1 1563181 1629375
chrUn_KI270322v1 9392841 9512920
chrUn_KI270335v1 7360266 7748998
chrUn_KI270366v1 3332191 3576201
chrUn_KI270378v1 8848225 9158581
chrUn_KI270387v1 5189812 5439563
chrUn_KI270387v1 7430161 7648856
chrUn_KI270419v1 5482364 5893900
chrUn_KI270438v1 8523562 8944980
chrUn_KI270507v1 599827 666674
chrUn_KI270517v1 3377374 3454859
chrUn_KI270521v1 6924036 7001552
chrUn_KI270544v1 7025954 7392905
chrUn_KI270582v1 6390491 6398266
chrUn_KI270589v1 5789405 6182867
chrUn_KI270591v1 9457975 9812609
chrUn_KI270741v1 1292450 1371680
chrX 4490524 4892623
如果您不介意在数字排序中字母排在数字之前(或者可以找到一个语言环境,但情况并非如此),这可能就是您所需要的:
$ cat tst.awk
{ vals[gensub(/chr/,"",1,)][][] }
END {
OFS = "\t"
PROCINFO["sorted_in"] = "@ind_num_asc"
for (chr in vals) {
for (beg in vals[chr]) {
for (end in vals[chr][beg]) {
print "chr"chr, beg, end
}
}
}
}
$ awk -f tst.awk file
chrX 6234 79593
chrY 204 310
chr1 1 100
chr1 5291 5500
chr2 33 52
chr2 555 1200
chr3 3 200
chr5 11 66
chr7 44 222
chr7 300 444
chr10 1 500
chr11 444 2141
chr16 224 5521
chr21 24 6023
chr22 99 1052
以上使用 GNU awk 作为数组的数组,sorted_in。
假设:
- 第一个字段将始终至少以一个字母开头,后跟 0 个或多个数字(即
[:alpha:]+[:digit:]*
)
- 字母和数字永远不会混合(例如,
d23FgE55
不会出现)
- 第一个字段仅由字母和数字组成(即
[:alnum:]
)
- 整个输入文件适合约 60% 的可用内存(否则将整个文件加载到
awk
数组可能会出现 'out of memory' 错误)
向输入添加更多数据点:
$ cat dat.raw
chr1 1 100
chr10 1 500
chr2 33 52
bec9 3 17
chr5 11 66
chr22 99 1052
chr11 444 2141
chr2 555 1200
chr7 300 444
defA 3 15
def7 13 15
def7 3 15
chr7 44 222
chr21 24 6023
chr16 224 5521
chr3 3 200
chrX 6234 79593
chr1 5291 5500
chrY 204 310
对数组的数组使用 GNU awk
和 PROCINFO["sorted_in"]
:
awk '
{ keyL=gensub(/([[:digit:]])/,"","g",) # strip numbers from 1st field
keyN=gensub(/([[:alpha:]])/,"","g",) # strip letters from 1st field
if ( keyN == "" ) keyN=999999999 # if no numbers in 1st field use a really big number to insure these rows go to the end of the list
arr[keyL][keyN][]=[=11=] # arr[] index = 1st field letters / 1st field numbers / 2nd field
}
END { PROCINFO["sorted_in"]="@ind_str_asc" # sort 1st index as string
for (i in arr) {
PROCINFO["sorted_in"]="@ind_num_asc" # sort 2nd/3rd indices as numbers
for (j in arr[i])
for (k in arr[i][j])
print arr[i][j][k]
}
}
' dat.raw
这会生成:
bec9 3 17
chr1 1 100
chr1 5291 5500
chr2 33 52
chr2 555 1200
chr3 3 200
chr5 11 66
chr7 44 222
chr7 300 444
chr10 1 500
chr11 444 2141
chr16 224 5521
chr21 24 6023
chr22 99 1052
chrX 6234 79593
chrY 204 310
def7 3 15
def7 13 15
defA 3 15
GNU 实用程序的版本排序 rules 完全实现起来并不容易。
鉴于你的第二个例子 unsorted
和 desired
:
head -n 5 unsorted
chr1_KI270708v1_random 9240458 9393655
chrUn_KI270589v1 5789405 6182867
chr19_GL383576v1_alt 4363702 4753141
chr8_KI270820v1_alt 1008865 1426444
chrUn_GL000220v1 7612088 7825236
...
head -n 5 desired
chr1_GL383520v2_alt 3718510 3948906
chr1_KI270708v1_random 9240458 9393655
chr1_KI270710v1_random 8843336 9304602
chr1_KI270711v1_random 6111532 6446935
chr1_KI270713v1_random 7260088 7358876
...
这里是ruby。它完全适用于您的第二个示例:
ruby -e 'ar=$<.read.split(/\r?\n/);
puts ar.sort_by {|e| e.split(/(\d+)/).map {|a| a =~ /\d+/ ? a.to_i : a }}.
join("\n")' unsorted >t_ruby
现在比较 t_ruby
和 desired
:
awk 'FNR==NR{idx[[=12=]]=FNR; next}
{print [=12=], FNR, idx[[=12=]]}
' t_ruby desired | head -n 10
chr1_GL383520v2_alt 3718510 3948906 1 1
chr1_KI270708v1_random 9240458 9393655 2 2
chr1_KI270710v1_random 8843336 9304602 3 3
chr1_KI270711v1_random 6111532 6446935 4 4
chr1_KI270713v1_random 7260088 7358876 5 5
chr1_KI270762v1_alt 4196600 4684821 6 6
chr1_KI270763v1_alt 1518275 1527443 7 7
chr2_KI270769v1_alt 8706329 9147304 8 8
chr2_KI270773v1_alt 8604675 8922311 9 9
chr3 3097856 3436204 10 10
... line by line the same ...
当我发布这个问题时,这就是我要找的。我相信这个算法可以加速和缩短很多。它将字符串拆分为数字部分和字符串部分,并逐节进行比较。我相信它的行为与 gnu sort -v
相同,除了它如何处理 .
我喜欢其他答案,因为它们 运行 很快,但我认为这个函数应该适用于任意长和奇怪的字母数字组。
awk '
function ivsort(i1,v1,i2,v2){
s=sprintf("%c",0)
split(gensub("[0-9]+",s"&"s,"g",i1),f1,s)
split(gensub("[0-9]+",s"&"s,"g",i2),f2,s)
for (val=1;val<=length(f1);val++){
if (!(val in f2)){
return -1
}
if (f1[val]==f2[val]){
continue
}
if (f1[val]+0 != f1[val] && f2[val]+0 != f2[val]){
return (f1[val]>f2[val])?1:-1
}
else{
return (f1[val]+0)-(f2[val]+0)
}
}
if (length(f2) > length(f1))
return -1
else
return length(i1) - length(i2)
}
到 运行 将该函数添加到 awk 程序的顶部并将其设置为 PROCINFO["sorted_in"]
的值
PROCINFO["sorted_in"] = "vsort"
for (x in a){
print x
}
我使用的生物信息学文件在一行中包含字符和数字的组合,如下所示
编辑更新示例
chr1 1 100
chr10 1 500
chr2 33 52
chr5 11 66
chr22 99 1052
chr11 444 2141
chr2 555 1200
chr7 300 444
chr7 44 222
chr21 24 6023
chr16 224 5521
chr3 3 200
chrX 6234 79593
chr1 5291 5500
chrY 204 310
我想这样排序
chr1 1 100
chr1 5291 5500
chr2 33 52
chr2 555 1200
chr3 3 200
chr5 11 66
chr7 44 222
chr7 300 444
chr10 1 500
chr11 444 2141
chr16 224 5521
chr21 24 6023
chr22 99 1052
chrX 6234 79593
chrY 204 310
我正在使用受控数组横向对它们进行排序,因为我使用 for 循环扫描它们,但已建立的排序方法是数字、字符串或类型排序。我想要与 GNU sort -V
排序完全一样的东西。在下面的 link 中,我看到我可以构建一个用于排序的自定义函数,但我不确定如何对字母数字值执行此操作。
我想做这样的事情,分别对字母值和数字部分进行排序,然后想办法将它们组合起来,但我不确定具体该怎么做。
echo "d23FgE55" | awk '{alph=;num=;gsub("[0-9]+","-",alph);gsub("[A-z]+","-",num);print alph,num}'
d-FgE- -23-55
编辑
抱歉,其中一个答案让我想起了我 运行 有时遇到的一些更复杂的例子,这让我问了一些更接近 sort -V
行为的问题。
未排序示例
chr1_KI270708v1_random 9240458 9393655
chrUn_KI270589v1 5789405 6182867
chr19_GL383576v1_alt 4363702 4753141
chr8_KI270820v1_alt 1008865 1426444
chrUn_GL000220v1 7612088 7825236
chrUn_KI270591v1 9457975 9812609
chr19_GL949747v2_alt 1578276 2033118
chr13_KI270841v1_alt 8680841 9033557
chr17_KI270859v1_alt 3996864 4344945
chr20_GL383577v2_alt 3002112 3480396
chrUn_KI270322v1 1563181 1629375
chrUn_KI270419v1 5482364 5893900
chrUn_KI270310v1 4229845 4626802
chr17_KI270907v1_alt 1735306 2201566
chr9 2052238 2476827
chr1_KI270713v1_random 7260088 7358876
chr19_KI270891v1_alt 7485890 7719006
chr19_KI270917v1_alt 7816269 7864474
chrUn_KI270378v1 8848225 9158581
chr19_KI270933v1_alt 1600444 2095219
chr3_KI270937v1_alt 9902343 10113942
chr12_KI270904v1_alt 6924313 7067502
chrUn_GL000214v1 2468418 2728693
chr19_KI270884v1_alt 2760167 3027068
chrUn_KI270582v1 6390491 6398266
chr5_KI270897v1_alt 324409 548890
chr1_GL383520v2_alt 3718510 3948906
chr19_KI270888v1_alt 3809944 3824324
chr11_JH159137v1_alt 6090599 6480896
chr1_KI270762v1_alt 4196600 4684821
chr11_KI270831v1_alt 2112401 2490081
chr5_GL339449v2_alt 7220557 7718111
chr19_KI270889v1_alt 8709969 8823256
chr19_KI270931v1_alt 6356216 6811953
chr5_GL383530v1_alt 109389 128439
chr11_KI270927v1_alt 2119470 2348459
chr17_KI270730v1_random 404268 854145
chrUn_KI270387v1 7430161 7648856
chr10 2656835 2873499
chr19_GL383573v1_alt 7863497 7896279
chrUn_KI270741v1 1292450 1371680
chrUn_KI270335v1 7360266 7748998
chr9_GL383539v1_alt 3394859 3499461
chrX 4490524 4892623
chrUn_KI270311v1 963681 1069745
chr11_JH159136v1_alt 8171978 8319851
chr17_JH159148v1_alt 3708868 3984631
chrUn_KI270544v1 7025954 7392905
chr19_KI270888v1_alt 9683785 10166473
chrUn_KI270521v1 6924036 7001552
chr1_KI270710v1_random 8843336 9304602
chr19_GL949746v1_alt 8572018 8832793
chrUn_KI270322v1 9392841 9512920
chrUn_KI270366v1 3332191 3576201
chr13_KI270841v1_alt 9033828 9276044
chr19_GL949748v2_alt 3321575 3743545
chr16 3704062 4122526
chr17_GL383563v3_alt 7476487 7845527
chr8_KI270810v1_alt 8499569 8873953
chr22_KI270732v1_random 9187626 9418154
chr20_GL383577v2_alt 3031797 3421627
chr9_GL383541v1_alt 3771485 3927905
chr9_KI270720v1_random 8948742 9157627
chr16_KI270854v1_alt 7278604 7297845
chr1_KI270763v1_alt 1518275 1527443
chrUn_KI270517v1 3377374 3454859
chr15_KI270850v1_alt 6358541 6822565
chr22_KB663609v1_alt 4944645 4971886
chr7_KI270806v1_alt 9032201 9471652
chrUn_KI270438v1 8523562 8944980
chr17_KI270730v1_random 3544067 3796807
chr18 1856815 2144546
chr20_KI270869v1_alt 2269342 2353172
chr5_GL383530v1_alt 2139701 2285854
chr8_KI270819v1_alt 7048265 7503415
chr17_JH159148v1_alt 7040113 7042904
chr5_KI270793v1_alt 9363008 9816819
chr19_KI270931v1_alt 1794178 2143519
chr3_KI270778v1_alt 2228100 2549359
chr19_KI270932v1_alt 8320855 8486835
chr12_GL877876v1_alt 3736839 3820171
chr12_GL877876v1_alt 2805577 2974710
chr4_KI270789v1_alt 3309756 3669565
chr19_KI270917v1_alt 9443280 9678387
chr11_KI270831v1_alt 8603751 9028904
chrUn_KI270387v1 5189812 5439563
chrUn_KI270507v1 599827 666674
chr1_KI270711v1_random 6111532 6446935
chr2_KI270773v1_alt 8604675 8922311
chr16_KI270856v1_alt 5855373 6089898
chr3 3097856 3436204
chr13_KI270840v1_alt 3127654 3295739
chr15_KI270849v1_alt 8948327 9336376
chr18_KI270863v1_alt 3984764 4166006
chr19_KI270920v1_alt 5554949 5919294
chr12 6624798 7106478
chr4_GL383528v1_alt 2099770 2280753
chr2_KI270769v1_alt 8706329 9147304
chr8_KI270812v1_alt 6146462 6388606
排序输出
chr1_GL383520v2_alt 3718510 3948906
chr1_KI270708v1_random 9240458 9393655
chr1_KI270710v1_random 8843336 9304602
chr1_KI270711v1_random 6111532 6446935
chr1_KI270713v1_random 7260088 7358876
chr1_KI270762v1_alt 4196600 4684821
chr1_KI270763v1_alt 1518275 1527443
chr2_KI270769v1_alt 8706329 9147304
chr2_KI270773v1_alt 8604675 8922311
chr3 3097856 3436204
chr3_KI270778v1_alt 2228100 2549359
chr3_KI270937v1_alt 9902343 10113942
chr4_GL383528v1_alt 2099770 2280753
chr4_KI270789v1_alt 3309756 3669565
chr5_GL339449v2_alt 7220557 7718111
chr5_GL383530v1_alt 109389 128439
chr5_GL383530v1_alt 2139701 2285854
chr5_KI270793v1_alt 9363008 9816819
chr5_KI270897v1_alt 324409 548890
chr7_KI270806v1_alt 9032201 9471652
chr8_KI270810v1_alt 8499569 8873953
chr8_KI270812v1_alt 6146462 6388606
chr8_KI270819v1_alt 7048265 7503415
chr8_KI270820v1_alt 1008865 1426444
chr9 2052238 2476827
chr9_GL383539v1_alt 3394859 3499461
chr9_GL383541v1_alt 3771485 3927905
chr9_KI270720v1_random 8948742 9157627
chr10 2656835 2873499
chr11_JH159136v1_alt 8171978 8319851
chr11_JH159137v1_alt 6090599 6480896
chr11_KI270831v1_alt 2112401 2490081
chr11_KI270831v1_alt 8603751 9028904
chr11_KI270927v1_alt 2119470 2348459
chr12 6624798 7106478
chr12_GL877876v1_alt 2805577 2974710
chr12_GL877876v1_alt 3736839 3820171
chr12_KI270904v1_alt 6924313 7067502
chr13_KI270840v1_alt 3127654 3295739
chr13_KI270841v1_alt 8680841 9033557
chr13_KI270841v1_alt 9033828 9276044
chr15_KI270849v1_alt 8948327 9336376
chr15_KI270850v1_alt 6358541 6822565
chr16 3704062 4122526
chr16_KI270854v1_alt 7278604 7297845
chr16_KI270856v1_alt 5855373 6089898
chr17_GL383563v3_alt 7476487 7845527
chr17_JH159148v1_alt 3708868 3984631
chr17_JH159148v1_alt 7040113 7042904
chr17_KI270730v1_random 404268 854145
chr17_KI270730v1_random 3544067 3796807
chr17_KI270859v1_alt 3996864 4344945
chr17_KI270907v1_alt 1735306 2201566
chr18 1856815 2144546
chr18_KI270863v1_alt 3984764 4166006
chr19_GL383573v1_alt 7863497 7896279
chr19_GL383576v1_alt 4363702 4753141
chr19_GL949746v1_alt 8572018 8832793
chr19_GL949747v2_alt 1578276 2033118
chr19_GL949748v2_alt 3321575 3743545
chr19_KI270884v1_alt 2760167 3027068
chr19_KI270888v1_alt 3809944 3824324
chr19_KI270888v1_alt 9683785 10166473
chr19_KI270889v1_alt 8709969 8823256
chr19_KI270891v1_alt 7485890 7719006
chr19_KI270917v1_alt 7816269 7864474
chr19_KI270917v1_alt 9443280 9678387
chr19_KI270920v1_alt 5554949 5919294
chr19_KI270931v1_alt 1794178 2143519
chr19_KI270931v1_alt 6356216 6811953
chr19_KI270932v1_alt 8320855 8486835
chr19_KI270933v1_alt 1600444 2095219
chr20_GL383577v2_alt 3002112 3480396
chr20_GL383577v2_alt 3031797 3421627
chr20_KI270869v1_alt 2269342 2353172
chr22_KB663609v1_alt 4944645 4971886
chr22_KI270732v1_random 9187626 9418154
chrUn_GL000214v1 2468418 2728693
chrUn_GL000220v1 7612088 7825236
chrUn_KI270310v1 4229845 4626802
chrUn_KI270311v1 963681 1069745
chrUn_KI270322v1 1563181 1629375
chrUn_KI270322v1 9392841 9512920
chrUn_KI270335v1 7360266 7748998
chrUn_KI270366v1 3332191 3576201
chrUn_KI270378v1 8848225 9158581
chrUn_KI270387v1 5189812 5439563
chrUn_KI270387v1 7430161 7648856
chrUn_KI270419v1 5482364 5893900
chrUn_KI270438v1 8523562 8944980
chrUn_KI270507v1 599827 666674
chrUn_KI270517v1 3377374 3454859
chrUn_KI270521v1 6924036 7001552
chrUn_KI270544v1 7025954 7392905
chrUn_KI270582v1 6390491 6398266
chrUn_KI270589v1 5789405 6182867
chrUn_KI270591v1 9457975 9812609
chrUn_KI270741v1 1292450 1371680
chrX 4490524 4892623
https://www.gnu.org/software/gawk/manual/html_node/Controlling-Array-Traversal.html
一种 gnu-awk
方法,将输入拆分为数字和字符串数组,然后使用 2 种不同的排序类型:
awk '{
v = gensub(/^[^0-9]+/, "", "1", )
if (v ~ /[0-9]/)
narr[v,] = [=10=]
else
sarr[v,] = [=10=]
}
END {
PROCINFO["sorted_in"]="@ind_num_asc"
for (i in narr) print narr[i]
PROCINFO["sorted_in"]="@ind_str_asc"
for (i in sarr) print sarr[i]
}' file
输出:
chr1_GL383520v2_alt 3718510 3948906
chr1_KI270708v1_random 9240458 9393655
chr1_KI270710v1_random 8843336 9304602
chr1_KI270711v1_random 6111532 6446935
chr1_KI270713v1_random 7260088 7358876
chr1_KI270762v1_alt 4196600 4684821
chr1_KI270763v1_alt 1518275 1527443
chr2_KI270769v1_alt 8706329 9147304
chr2_KI270773v1_alt 8604675 8922311
chr3 3097856 3436204
chr3_KI270778v1_alt 2228100 2549359
chr3_KI270937v1_alt 9902343 10113942
chr4_GL383528v1_alt 2099770 2280753
chr4_KI270789v1_alt 3309756 3669565
chr5_GL339449v2_alt 7220557 7718111
chr5_GL383530v1_alt 109389 128439
chr5_GL383530v1_alt 2139701 2285854
chr5_KI270793v1_alt 9363008 9816819
chr5_KI270897v1_alt 324409 548890
chr7_KI270806v1_alt 9032201 9471652
chr8_KI270810v1_alt 8499569 8873953
chr8_KI270812v1_alt 6146462 6388606
chr8_KI270819v1_alt 7048265 7503415
chr8_KI270820v1_alt 1008865 1426444
chr9 2052238 2476827
chr9_GL383539v1_alt 3394859 3499461
chr9_GL383541v1_alt 3771485 3927905
chr9_KI270720v1_random 8948742 9157627
chr10 2656835 2873499
chr11_JH159136v1_alt 8171978 8319851
chr11_JH159137v1_alt 6090599 6480896
chr11_KI270831v1_alt 2112401 2490081
chr11_KI270831v1_alt 8603751 9028904
chr11_KI270927v1_alt 2119470 2348459
chr12 6624798 7106478
chr12_GL877876v1_alt 2805577 2974710
chr12_GL877876v1_alt 3736839 3820171
chr12_KI270904v1_alt 6924313 7067502
chr13_KI270840v1_alt 3127654 3295739
chr13_KI270841v1_alt 8680841 9033557
chr13_KI270841v1_alt 9033828 9276044
chr15_KI270849v1_alt 8948327 9336376
chr15_KI270850v1_alt 6358541 6822565
chr16 3704062 4122526
chr16_KI270854v1_alt 7278604 7297845
chr16_KI270856v1_alt 5855373 6089898
chr17_GL383563v3_alt 7476487 7845527
chr17_JH159148v1_alt 3708868 3984631
chr17_JH159148v1_alt 7040113 7042904
chr17_KI270730v1_random 3544067 3796807
chr17_KI270730v1_random 404268 854145
chr17_KI270859v1_alt 3996864 4344945
chr17_KI270907v1_alt 1735306 2201566
chr18 1856815 2144546
chr18_KI270863v1_alt 3984764 4166006
chr19_GL383573v1_alt 7863497 7896279
chr19_GL383576v1_alt 4363702 4753141
chr19_GL949746v1_alt 8572018 8832793
chr19_GL949747v2_alt 1578276 2033118
chr19_GL949748v2_alt 3321575 3743545
chr19_KI270884v1_alt 2760167 3027068
chr19_KI270888v1_alt 3809944 3824324
chr19_KI270888v1_alt 9683785 10166473
chr19_KI270889v1_alt 8709969 8823256
chr19_KI270891v1_alt 7485890 7719006
chr19_KI270917v1_alt 7816269 7864474
chr19_KI270917v1_alt 9443280 9678387
chr19_KI270920v1_alt 5554949 5919294
chr19_KI270931v1_alt 1794178 2143519
chr19_KI270931v1_alt 6356216 6811953
chr19_KI270932v1_alt 8320855 8486835
chr19_KI270933v1_alt 1600444 2095219
chr20_GL383577v2_alt 3002112 3480396
chr20_GL383577v2_alt 3031797 3421627
chr20_KI270869v1_alt 2269342 2353172
chr22_KB663609v1_alt 4944645 4971886
chr22_KI270732v1_random 9187626 9418154
chrUn_GL000214v1 2468418 2728693
chrUn_GL000220v1 7612088 7825236
chrUn_KI270310v1 4229845 4626802
chrUn_KI270311v1 963681 1069745
chrUn_KI270322v1 1563181 1629375
chrUn_KI270322v1 9392841 9512920
chrUn_KI270335v1 7360266 7748998
chrUn_KI270366v1 3332191 3576201
chrUn_KI270378v1 8848225 9158581
chrUn_KI270387v1 5189812 5439563
chrUn_KI270387v1 7430161 7648856
chrUn_KI270419v1 5482364 5893900
chrUn_KI270438v1 8523562 8944980
chrUn_KI270507v1 599827 666674
chrUn_KI270517v1 3377374 3454859
chrUn_KI270521v1 6924036 7001552
chrUn_KI270544v1 7025954 7392905
chrUn_KI270582v1 6390491 6398266
chrUn_KI270589v1 5789405 6182867
chrUn_KI270591v1 9457975 9812609
chrUn_KI270741v1 1292450 1371680
chrX 4490524 4892623
如果您不介意在数字排序中字母排在数字之前(或者可以找到一个语言环境,但情况并非如此),这可能就是您所需要的:
$ cat tst.awk
{ vals[gensub(/chr/,"",1,)][][] }
END {
OFS = "\t"
PROCINFO["sorted_in"] = "@ind_num_asc"
for (chr in vals) {
for (beg in vals[chr]) {
for (end in vals[chr][beg]) {
print "chr"chr, beg, end
}
}
}
}
$ awk -f tst.awk file
chrX 6234 79593
chrY 204 310
chr1 1 100
chr1 5291 5500
chr2 33 52
chr2 555 1200
chr3 3 200
chr5 11 66
chr7 44 222
chr7 300 444
chr10 1 500
chr11 444 2141
chr16 224 5521
chr21 24 6023
chr22 99 1052
以上使用 GNU awk 作为数组的数组,sorted_in。
假设:
- 第一个字段将始终至少以一个字母开头,后跟 0 个或多个数字(即
[:alpha:]+[:digit:]*
) - 字母和数字永远不会混合(例如,
d23FgE55
不会出现) - 第一个字段仅由字母和数字组成(即
[:alnum:]
) - 整个输入文件适合约 60% 的可用内存(否则将整个文件加载到
awk
数组可能会出现 'out of memory' 错误)
向输入添加更多数据点:
$ cat dat.raw
chr1 1 100
chr10 1 500
chr2 33 52
bec9 3 17
chr5 11 66
chr22 99 1052
chr11 444 2141
chr2 555 1200
chr7 300 444
defA 3 15
def7 13 15
def7 3 15
chr7 44 222
chr21 24 6023
chr16 224 5521
chr3 3 200
chrX 6234 79593
chr1 5291 5500
chrY 204 310
对数组的数组使用 GNU awk
和 PROCINFO["sorted_in"]
:
awk '
{ keyL=gensub(/([[:digit:]])/,"","g",) # strip numbers from 1st field
keyN=gensub(/([[:alpha:]])/,"","g",) # strip letters from 1st field
if ( keyN == "" ) keyN=999999999 # if no numbers in 1st field use a really big number to insure these rows go to the end of the list
arr[keyL][keyN][]=[=11=] # arr[] index = 1st field letters / 1st field numbers / 2nd field
}
END { PROCINFO["sorted_in"]="@ind_str_asc" # sort 1st index as string
for (i in arr) {
PROCINFO["sorted_in"]="@ind_num_asc" # sort 2nd/3rd indices as numbers
for (j in arr[i])
for (k in arr[i][j])
print arr[i][j][k]
}
}
' dat.raw
这会生成:
bec9 3 17
chr1 1 100
chr1 5291 5500
chr2 33 52
chr2 555 1200
chr3 3 200
chr5 11 66
chr7 44 222
chr7 300 444
chr10 1 500
chr11 444 2141
chr16 224 5521
chr21 24 6023
chr22 99 1052
chrX 6234 79593
chrY 204 310
def7 3 15
def7 13 15
defA 3 15
GNU 实用程序的版本排序 rules 完全实现起来并不容易。
鉴于你的第二个例子 unsorted
和 desired
:
head -n 5 unsorted
chr1_KI270708v1_random 9240458 9393655
chrUn_KI270589v1 5789405 6182867
chr19_GL383576v1_alt 4363702 4753141
chr8_KI270820v1_alt 1008865 1426444
chrUn_GL000220v1 7612088 7825236
...
head -n 5 desired
chr1_GL383520v2_alt 3718510 3948906
chr1_KI270708v1_random 9240458 9393655
chr1_KI270710v1_random 8843336 9304602
chr1_KI270711v1_random 6111532 6446935
chr1_KI270713v1_random 7260088 7358876
...
这里是ruby。它完全适用于您的第二个示例:
ruby -e 'ar=$<.read.split(/\r?\n/);
puts ar.sort_by {|e| e.split(/(\d+)/).map {|a| a =~ /\d+/ ? a.to_i : a }}.
join("\n")' unsorted >t_ruby
现在比较 t_ruby
和 desired
:
awk 'FNR==NR{idx[[=12=]]=FNR; next}
{print [=12=], FNR, idx[[=12=]]}
' t_ruby desired | head -n 10
chr1_GL383520v2_alt 3718510 3948906 1 1
chr1_KI270708v1_random 9240458 9393655 2 2
chr1_KI270710v1_random 8843336 9304602 3 3
chr1_KI270711v1_random 6111532 6446935 4 4
chr1_KI270713v1_random 7260088 7358876 5 5
chr1_KI270762v1_alt 4196600 4684821 6 6
chr1_KI270763v1_alt 1518275 1527443 7 7
chr2_KI270769v1_alt 8706329 9147304 8 8
chr2_KI270773v1_alt 8604675 8922311 9 9
chr3 3097856 3436204 10 10
... line by line the same ...
当我发布这个问题时,这就是我要找的。我相信这个算法可以加速和缩短很多。它将字符串拆分为数字部分和字符串部分,并逐节进行比较。我相信它的行为与 gnu sort -v
相同,除了它如何处理 .
我喜欢其他答案,因为它们 运行 很快,但我认为这个函数应该适用于任意长和奇怪的字母数字组。
awk '
function ivsort(i1,v1,i2,v2){
s=sprintf("%c",0)
split(gensub("[0-9]+",s"&"s,"g",i1),f1,s)
split(gensub("[0-9]+",s"&"s,"g",i2),f2,s)
for (val=1;val<=length(f1);val++){
if (!(val in f2)){
return -1
}
if (f1[val]==f2[val]){
continue
}
if (f1[val]+0 != f1[val] && f2[val]+0 != f2[val]){
return (f1[val]>f2[val])?1:-1
}
else{
return (f1[val]+0)-(f2[val]+0)
}
}
if (length(f2) > length(f1))
return -1
else
return length(i1) - length(i2)
}
到 运行 将该函数添加到 awk 程序的顶部并将其设置为 PROCINFO["sorted_in"]
PROCINFO["sorted_in"] = "vsort"
for (x in a){
print x
}