使用 awk 在文件开头添加列

Add column at beginning of file using awk

我有一个非常简单的衬垫,几乎可以完美地工作。我想根据第 12 列和第 3 列的条件向文件中添加一个新列,该列显示“非编码或编码”(如果第 12 列具有子字符串 RNA 或 mir- and/or 第 3 列 ==“假基因,则第 1 列应该阅读非编码,否则编码)。

#file

X   FlyBase pseudogene  19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821"; transcript_id "FBtr0307588"; transcript_symbol "CR32821-RB";
X   FlyBase pseudogene  476857  479309  .   -   .   gene_id "FBgn0029523"; gene_symbol "CR18275"; transcript_id "FBtr0070097"; transcript_symbol "CR18275-RA";
X   FlyBase pseudogene  5832298 5832368 .   +   .   gene_id "FBgn0052761"; gene_symbol "tRNA:Glu-CTC-6-1Psi"; transcript_id "FBtr0070818"; transcript_symbol "tRNA:Glu-CTC-6-1Psi-RA";
X   FlyBase pseudogene  6361496 6362960 .   -   .   gene_id "FBgn0016974"; gene_symbol "swaPsi"; transcript_id "FBtr0070923"; transcript_symbol "swaPsi-RA";
X   FlyBase pseudogene  6361496 6363310 .   -   .   gene_id "FBgn0016974"; gene_symbol "swaPsi"; transcript_id "FBtr0334014"; transcript_symbol "swaPsi-RB";
X   FlyBase gene    20025099    20025170    .   +   .   gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
X   FlyBase gene    1482492 1482590 .   -   .   gene_id "FBgn0044508"; gene_symbol "snoRNA:M";
X   FlyBase gene    2330159 2330826 .   +   .   gene_id "FBgn0053218"; gene_symbol "lncRNA:CR33218";
X   FlyBase gene    3427452 3427523 .   -   .   gene_id "FBgn0052493"; gene_symbol "tRNA:Gln-TTG-2-1";
X   FlyBase gene    3819699 3819770 .   +   .   gene_id "FBgn0052785"; gene_symbol "tRNA:Gln-CTG-2-1";
X   FlyBase gene    3827622 3827693 .   +   .   gene_id "FBgn0025118"; gene_symbol "tRNA:Pro-CGG-3-1";
2L  FlyBase gene    825969  833241  .   +   .   gene_id "FBgn0010583"; gene_symbol "dock";
2L  FlyBase gene    852768  854539  .   +   .   gene_id "FBgn0020545"; gene_symbol "kraken";
2L  FlyBase gene    855337  856639  .   +   .   gene_id "FBgn0031288"; gene_symbol "CG13949";
2L  FlyBase gene    860197  861806  .   +   .   gene_id "FBgn0031289"; gene_symbol "CG13950";
2L  FlyBase gene    877302  878270  .   +   .   gene_id "FBgn0002936"; gene_symbol "ninaA";

#command 

awk '{ if( ~ /RNA/ ||  ~ /mir-/ ||  == "pseudogene")  ="non-coding"; else ="coding"; print }'  a.gene-pseudogene_all_dmel-all-r6.40.gtf 

代码有效,但它替换了第 1 列。这不是我想要的,我想在第 1 列之前添加这个新列(因此它成为新的第 1 列)。

如何调整?

您可以(有效地)通过将 </code> 转换为 <code>something OFS 来添加一个新列。您并没有真正创建一个新列(</code> 仍然指原始的第二列,<code> 指的是“两个”新列),但这在这种情况下并不重要:

awk '{
  x = ( ~/RNA|mir-/ || =="pseudogene" ) ? "non-" : ""
   = x "coding" OFS 
  print
}' a.gene-pseudogene_all_dmel-all-r6.40.gtf

上述技术可用于在任何列之前(或之后)插入。因为我们在第一列之前添加(或者如果我们在最后一列之后附加),代码可以通过避免分配来提高效率:

awk '{
  x = ( ~/RNA|mir-/ || =="pseudogene" ) ? "non-" : ""
  print x "coding", [=11=]
}' a.gene-pseudogene_all_dmel-all-r6.40.gtf