随机对特定列进行排序
Sorting specific columns randomly
我有一个如下所示的文件:
1 90042011 90042031 AGTGCCACCAGGTGGGGCCG 90042029 2 5008 G:5006 T:2 90184334 90184354 GAGGCCACAAGAGGGCACAA 90184337 2 5008 C:5007 T:1
1 94853396 94853416 GCCGCCACTAGATGGTGCTA 94853398 2 5008 T:5007 C:1 94969254 94969274 CAGTCCTCTAGAGGGAGCCC 94969266 2 5008 G:5006 A:2
2 103584283 103584303 TGGGCCACTAGGAGGCACTG 103584300 2 5008 C:5006 T:2 103841436 103841456 CATGCCACAAGAGGGCATCA 103841456 2 5008 A:5006 G:2
3 67156478 67156498 TGGACCACCAGGTGGCAGTA 67156492 2 5008 A:3308 T:1700 67316925 67316945 CTGACCACCAGAGGGCAACA 67316942 2 5008 C:5004 T:4
4 106206208 106206228 AAGGCCAGAAGAGGGCATCA 106206210 2 5008 C:5007 T:1 106381214 106381234 CGTGCCAGCAGAGGGCGGTG 106381217 2 5008 A:5007 G:1
4 106511652 106511672 GCGGCCTCTAGGGGGCACTG 106511672 2 5008 C:5000 T:8 106836794 106836814 TCCACCAGTAGAGGGTACTA 106836796 2 5008 G:5007 T:1
4 107410713 107410733 GTGGCCAGCAGGGGGCACTC 107410725 2 5008 T:5007 G:1 107517764 107517784 CTCTCCACACGGGGGCGCCC 107517769 2 5008 G:654 GAGAGGTATTCCCT:4354
我想编写一个脚本,将 </code> 到 <code>
的值视为一行,并随机打乱它们,从 </code> 到 <code>
的值就像他们是。所以输出看起来像这样:
1 90042011 90042031 AGTGCCACCAGGTGGGGCCG 90042029 2 5008 G:5006 T:2 107517764 107517784 CTCTCCACACGGGGGCGCCC 107517769 2 5008 G:654 GAGAGGTATTCCCT:4354
1 94853396 94853416 GCCGCCACTAGATGGTGCTA 94853398 2 5008 T:5007 C:1 103841436 103841456 CATGCCACAAGAGGGCATCA 103841456 2 5008 A:5006 G:2
2 103584283 103584303 TGGGCCACTAGGAGGCACTG 103584300 2 5008 C:5006 T:2 106381214 106381234 CGTGCCAGCAGAGGGCGGTG 106381217 2 5008 A:5007 G:1
3 67156478 67156498 TGGACCACCAGGTGGCAGTA 67156492 2 5008 A:3308 T:1700 90184334 90184354 GAGGCCACAAGAGGGCACAA 90184337 2 5008 C:5007 T:1
4 106206208 106206228 AAGGCCAGAAGAGGGCATCA 106206210 2 5008 C:5007 T:1 94969254 94969274 CAGTCCTCTAGAGGGAGCCC 94969266 2 5008 G:5006 A:2
4 106511652 106511672 GCGGCCTCTAGGGGGCACTG 106511672 2 5008 C:5000 T:8 67316925 67316945 CTGACCACCAGAGGGCAACA 67316942 2 5008 C:5004 T:4
4 107410713 107410733 GTGGCCAGCAGGGGGCACTC 107410725 2 5008 T:5007 G:1 106836794 106836814 TCCACCAGTAGAGGGTACTA 106836796 2 5008 G:5007 T:1
如果列以制表符分隔,您可以使用 cut
提取要保留的列和要随机排列的列,然后使用 paste
将它们粘合回去:
paste <(cut -f1-9 file) <(cut -f10- file | shuf )
从 [ here ] 借来的 Shuffle 脚本,以下脚本存储在文件中
$ cat shuffler.awk
# actual shuffle function
# shuffles the values in "array" in-place, from indices "left" to "right".
# required for all of the shuf() functions below
function __shuffle(array, left, right, r, i, tmp) {
# loop backwards over the elements
for (i=right; i>left; i--) {
# generate a random number between the start and current element
r = int(rand() * (i - left + 1)) + left;
# swap current element and randomly generated one
tmp = array[i];
array[i] = array[r];
array[r] = tmp;
}
}
## usage: shuf(s, d)
## shuffles the array "s", creating a new shuffled array "d" indexed with
## sequential integers starting with one. returns the length, or -1 if an error
## occurs. leaves the indices of the source array "s" unchanged. uses the knuth-
## fisher-yates algorithm. requires the __shuffle() function.
function shuf(array, out, count, i) {
# loop over each index, and generate a new array with the same values and
# sequential indices
count = 0;
for (i in array) {
out[++count] = array[i];
}
# seed the random number generator
srand();
# actually shuffle
__shuffle(out, 1, count);
# return the length
return count;
}
NR==FNR{
for(i=1;i<=9;i++){fp[NR]=fp[NR] OFS $i}
for(i=10;i<=17;i++){sp[NR]=sp[NR] OFS $i}
}
END{
shuf(sp,spnew)
for(i=1;i<=NR;i++)
print fp[i] spnew[i]
}
运行 脚本如下:
$ awk -f shuffler.awk casefile_48424159 | column -t
1 90042011 90042031 AGTGCCACCAGGTGGGGCCG 90042029 2 5008 G:5006 T:2 94969254 94969274 CAGTCCTCTAGAGGGAGCCC 94969266 2 5008 G:5006 A:2
1 94853396 94853416 GCCGCCACTAGATGGTGCTA 94853398 2 5008 T:5007 C:1 103841436 103841456 CATGCCACAAGAGGGCATCA 103841456 2 5008 A:5006 G:2
2 103584283 103584303 TGGGCCACTAGGAGGCACTG 103584300 2 5008 C:5006 T:2 90184334 90184354 GAGGCCACAAGAGGGCACAA 90184337 2 5008 C:5007 T:1
3 67156478 67156498 TGGACCACCAGGTGGCAGTA 67156492 2 5008 A:3308 T:1700 67316925 67316945 CTGACCACCAGAGGGCAACA 67316942 2 5008 C:5004 T:4
4 106206208 106206228 AAGGCCAGAAGAGGGCATCA 106206210 2 5008 C:5007 T:1 106381214 106381234 CGTGCCAGCAGAGGGCGGTG 106381217 2 5008 A:5007 G:1
4 106511652 106511672 GCGGCCTCTAGGGGGCACTG 106511672 2 5008 C:5000 T:8 107517764 107517784 CTCTCCACACGGGGGCGCCC 107517769 2 5008 G:654 GAGAGGTATTCCCT:4354
4 107410713 107410733 GTGGCCAGCAGGGGGCACTC 107410725 2 5008 T:5007 G:1 106836794 106836814 TCCACCAGTAGAGGGTACTA 106836796 2 5008 G:5007 T:1
我有一个如下所示的文件:
1 90042011 90042031 AGTGCCACCAGGTGGGGCCG 90042029 2 5008 G:5006 T:2 90184334 90184354 GAGGCCACAAGAGGGCACAA 90184337 2 5008 C:5007 T:1
1 94853396 94853416 GCCGCCACTAGATGGTGCTA 94853398 2 5008 T:5007 C:1 94969254 94969274 CAGTCCTCTAGAGGGAGCCC 94969266 2 5008 G:5006 A:2
2 103584283 103584303 TGGGCCACTAGGAGGCACTG 103584300 2 5008 C:5006 T:2 103841436 103841456 CATGCCACAAGAGGGCATCA 103841456 2 5008 A:5006 G:2
3 67156478 67156498 TGGACCACCAGGTGGCAGTA 67156492 2 5008 A:3308 T:1700 67316925 67316945 CTGACCACCAGAGGGCAACA 67316942 2 5008 C:5004 T:4
4 106206208 106206228 AAGGCCAGAAGAGGGCATCA 106206210 2 5008 C:5007 T:1 106381214 106381234 CGTGCCAGCAGAGGGCGGTG 106381217 2 5008 A:5007 G:1
4 106511652 106511672 GCGGCCTCTAGGGGGCACTG 106511672 2 5008 C:5000 T:8 106836794 106836814 TCCACCAGTAGAGGGTACTA 106836796 2 5008 G:5007 T:1
4 107410713 107410733 GTGGCCAGCAGGGGGCACTC 107410725 2 5008 T:5007 G:1 107517764 107517784 CTCTCCACACGGGGGCGCCC 107517769 2 5008 G:654 GAGAGGTATTCCCT:4354
我想编写一个脚本,将 </code> 到 <code>
的值视为一行,并随机打乱它们,从 </code> 到 <code>
的值就像他们是。所以输出看起来像这样:
1 90042011 90042031 AGTGCCACCAGGTGGGGCCG 90042029 2 5008 G:5006 T:2 107517764 107517784 CTCTCCACACGGGGGCGCCC 107517769 2 5008 G:654 GAGAGGTATTCCCT:4354
1 94853396 94853416 GCCGCCACTAGATGGTGCTA 94853398 2 5008 T:5007 C:1 103841436 103841456 CATGCCACAAGAGGGCATCA 103841456 2 5008 A:5006 G:2
2 103584283 103584303 TGGGCCACTAGGAGGCACTG 103584300 2 5008 C:5006 T:2 106381214 106381234 CGTGCCAGCAGAGGGCGGTG 106381217 2 5008 A:5007 G:1
3 67156478 67156498 TGGACCACCAGGTGGCAGTA 67156492 2 5008 A:3308 T:1700 90184334 90184354 GAGGCCACAAGAGGGCACAA 90184337 2 5008 C:5007 T:1
4 106206208 106206228 AAGGCCAGAAGAGGGCATCA 106206210 2 5008 C:5007 T:1 94969254 94969274 CAGTCCTCTAGAGGGAGCCC 94969266 2 5008 G:5006 A:2
4 106511652 106511672 GCGGCCTCTAGGGGGCACTG 106511672 2 5008 C:5000 T:8 67316925 67316945 CTGACCACCAGAGGGCAACA 67316942 2 5008 C:5004 T:4
4 107410713 107410733 GTGGCCAGCAGGGGGCACTC 107410725 2 5008 T:5007 G:1 106836794 106836814 TCCACCAGTAGAGGGTACTA 106836796 2 5008 G:5007 T:1
如果列以制表符分隔,您可以使用 cut
提取要保留的列和要随机排列的列,然后使用 paste
将它们粘合回去:
paste <(cut -f1-9 file) <(cut -f10- file | shuf )
从 [ here ] 借来的 Shuffle 脚本,以下脚本存储在文件中
$ cat shuffler.awk
# actual shuffle function
# shuffles the values in "array" in-place, from indices "left" to "right".
# required for all of the shuf() functions below
function __shuffle(array, left, right, r, i, tmp) {
# loop backwards over the elements
for (i=right; i>left; i--) {
# generate a random number between the start and current element
r = int(rand() * (i - left + 1)) + left;
# swap current element and randomly generated one
tmp = array[i];
array[i] = array[r];
array[r] = tmp;
}
}
## usage: shuf(s, d)
## shuffles the array "s", creating a new shuffled array "d" indexed with
## sequential integers starting with one. returns the length, or -1 if an error
## occurs. leaves the indices of the source array "s" unchanged. uses the knuth-
## fisher-yates algorithm. requires the __shuffle() function.
function shuf(array, out, count, i) {
# loop over each index, and generate a new array with the same values and
# sequential indices
count = 0;
for (i in array) {
out[++count] = array[i];
}
# seed the random number generator
srand();
# actually shuffle
__shuffle(out, 1, count);
# return the length
return count;
}
NR==FNR{
for(i=1;i<=9;i++){fp[NR]=fp[NR] OFS $i}
for(i=10;i<=17;i++){sp[NR]=sp[NR] OFS $i}
}
END{
shuf(sp,spnew)
for(i=1;i<=NR;i++)
print fp[i] spnew[i]
}
运行 脚本如下:
$ awk -f shuffler.awk casefile_48424159 | column -t
1 90042011 90042031 AGTGCCACCAGGTGGGGCCG 90042029 2 5008 G:5006 T:2 94969254 94969274 CAGTCCTCTAGAGGGAGCCC 94969266 2 5008 G:5006 A:2
1 94853396 94853416 GCCGCCACTAGATGGTGCTA 94853398 2 5008 T:5007 C:1 103841436 103841456 CATGCCACAAGAGGGCATCA 103841456 2 5008 A:5006 G:2
2 103584283 103584303 TGGGCCACTAGGAGGCACTG 103584300 2 5008 C:5006 T:2 90184334 90184354 GAGGCCACAAGAGGGCACAA 90184337 2 5008 C:5007 T:1
3 67156478 67156498 TGGACCACCAGGTGGCAGTA 67156492 2 5008 A:3308 T:1700 67316925 67316945 CTGACCACCAGAGGGCAACA 67316942 2 5008 C:5004 T:4
4 106206208 106206228 AAGGCCAGAAGAGGGCATCA 106206210 2 5008 C:5007 T:1 106381214 106381234 CGTGCCAGCAGAGGGCGGTG 106381217 2 5008 A:5007 G:1
4 106511652 106511672 GCGGCCTCTAGGGGGCACTG 106511672 2 5008 C:5000 T:8 107517764 107517784 CTCTCCACACGGGGGCGCCC 107517769 2 5008 G:654 GAGAGGTATTCCCT:4354
4 107410713 107410733 GTGGCCAGCAGGGGGCACTC 107410725 2 5008 T:5007 G:1 106836794 106836814 TCCACCAGTAGAGGGTACTA 106836796 2 5008 G:5007 T:1