随机对特定列进行排序

Sorting specific columns randomly

我有一个如下所示的文件:

1   90042011    90042031    AGTGCCACCAGGTGGGGCCG    90042029    2   5008    G:5006  T:2     90184334    90184354    GAGGCCACAAGAGGGCACAA    90184337    2   5008    C:5007  T:1
1   94853396    94853416    GCCGCCACTAGATGGTGCTA    94853398    2   5008    T:5007  C:1     94969254    94969274    CAGTCCTCTAGAGGGAGCCC    94969266    2   5008    G:5006  A:2
2   103584283   103584303   TGGGCCACTAGGAGGCACTG    103584300   2   5008    C:5006  T:2     103841436   103841456   CATGCCACAAGAGGGCATCA    103841456   2   5008    A:5006  G:2
3   67156478    67156498    TGGACCACCAGGTGGCAGTA    67156492    2   5008    A:3308  T:1700      67316925    67316945    CTGACCACCAGAGGGCAACA    67316942    2   5008    C:5004  T:4
4   106206208   106206228   AAGGCCAGAAGAGGGCATCA    106206210   2   5008    C:5007  T:1     106381214   106381234   CGTGCCAGCAGAGGGCGGTG    106381217   2   5008    A:5007  G:1
4   106511652   106511672   GCGGCCTCTAGGGGGCACTG    106511672   2   5008    C:5000  T:8     106836794   106836814   TCCACCAGTAGAGGGTACTA    106836796   2   5008    G:5007  T:1
4   107410713   107410733   GTGGCCAGCAGGGGGCACTC    107410725   2   5008    T:5007  G:1     107517764   107517784   CTCTCCACACGGGGGCGCCC    107517769   2   5008    G:654   GAGAGGTATTCCCT:4354

我想编写一个脚本,将 </code> 到 <code> 的值视为一行,并随机打乱它们,从 </code> 到 <code> 的值就像他们是。所以输出看起来像这样:

1   90042011    90042031    AGTGCCACCAGGTGGGGCCG    90042029    2   5008    G:5006  T:2     107517764   107517784   CTCTCCACACGGGGGCGCCC    107517769   2   5008    G:654   GAGAGGTATTCCCT:4354
1   94853396    94853416    GCCGCCACTAGATGGTGCTA    94853398    2   5008    T:5007  C:1     103841436   103841456   CATGCCACAAGAGGGCATCA    103841456   2   5008    A:5006  G:2
2   103584283   103584303   TGGGCCACTAGGAGGCACTG    103584300   2   5008    C:5006  T:2     106381214   106381234   CGTGCCAGCAGAGGGCGGTG    106381217   2   5008    A:5007  G:1
3   67156478    67156498    TGGACCACCAGGTGGCAGTA    67156492    2   5008    A:3308  T:1700      90184334    90184354    GAGGCCACAAGAGGGCACAA    90184337    2   5008    C:5007  T:1
4   106206208   106206228   AAGGCCAGAAGAGGGCATCA    106206210   2   5008    C:5007  T:1     94969254    94969274    CAGTCCTCTAGAGGGAGCCC    94969266    2   5008    G:5006  A:2
4   106511652   106511672   GCGGCCTCTAGGGGGCACTG    106511672   2   5008    C:5000  T:8     67316925    67316945    CTGACCACCAGAGGGCAACA    67316942    2   5008    C:5004  T:4
4   107410713   107410733   GTGGCCAGCAGGGGGCACTC    107410725   2   5008    T:5007  G:1     106836794   106836814   TCCACCAGTAGAGGGTACTA    106836796   2   5008    G:5007  T:1

如果列以制表符分隔,您可以使用 cut 提取要保留的列和要随机排列的列,然后使用 paste 将它们粘合回去:

paste <(cut -f1-9 file) <(cut -f10- file | shuf )

[ here ] 借来的 Shuffle 脚本,以下脚本存储在文件中

$ cat shuffler.awk
# actual shuffle function
# shuffles the values in "array" in-place, from indices "left" to "right".
# required for all of the shuf() functions below

function __shuffle(array, left, right,    r, i, tmp) {
  # loop backwards over the elements
  for (i=right; i>left; i--) {
    # generate a random number between the start and current element
    r = int(rand() * (i - left + 1)) + left;

    # swap current element and randomly generated one
    tmp = array[i];
    array[i] = array[r];
    array[r] = tmp;
  }
}
## usage: shuf(s, d)
## shuffles the array "s", creating a new shuffled array "d" indexed with
## sequential integers starting with one. returns the length, or -1 if an error
## occurs. leaves the indices of the source array "s" unchanged. uses the knuth-
## fisher-yates algorithm. requires the __shuffle() function.
function shuf(array, out,    count, i) {
  # loop over each index, and generate a new array with the same values and
  # sequential indices
  count = 0;
  for (i in array) {
    out[++count] = array[i];
  }

  # seed the random number generator
  srand();

  # actually shuffle
  __shuffle(out, 1, count);

  # return the length
  return count;
}

NR==FNR{
for(i=1;i<=9;i++){fp[NR]=fp[NR] OFS $i}
for(i=10;i<=17;i++){sp[NR]=sp[NR] OFS $i}
}
END{
shuf(sp,spnew)
for(i=1;i<=NR;i++)
 print fp[i] spnew[i]
}

运行 脚本如下:

$ awk -f shuffler.awk casefile_48424159 | column -t
1  90042011   90042031   AGTGCCACCAGGTGGGGCCG  90042029   2  5008  G:5006  T:2     94969254   94969274   CAGTCCTCTAGAGGGAGCCC  94969266   2  5008  G:5006  A:2
1  94853396   94853416   GCCGCCACTAGATGGTGCTA  94853398   2  5008  T:5007  C:1     103841436  103841456  CATGCCACAAGAGGGCATCA  103841456  2  5008  A:5006  G:2
2  103584283  103584303  TGGGCCACTAGGAGGCACTG  103584300  2  5008  C:5006  T:2     90184334   90184354   GAGGCCACAAGAGGGCACAA  90184337   2  5008  C:5007  T:1
3  67156478   67156498   TGGACCACCAGGTGGCAGTA  67156492   2  5008  A:3308  T:1700  67316925   67316945   CTGACCACCAGAGGGCAACA  67316942   2  5008  C:5004  T:4
4  106206208  106206228  AAGGCCAGAAGAGGGCATCA  106206210  2  5008  C:5007  T:1     106381214  106381234  CGTGCCAGCAGAGGGCGGTG  106381217  2  5008  A:5007  G:1
4  106511652  106511672  GCGGCCTCTAGGGGGCACTG  106511672  2  5008  C:5000  T:8     107517764  107517784  CTCTCCACACGGGGGCGCCC  107517769  2  5008  G:654   GAGAGGTATTCCCT:4354
4  107410713  107410733  GTGGCCAGCAGGGGGCACTC  107410725  2  5008  T:5007  G:1     106836794  106836814  TCCACCAGTAGAGGGTACTA  106836796  2  5008  G:5007  T:1