使用 awk 在文件开头添加列
Add column at beginning of file using awk
我有一个非常简单的衬垫,几乎可以完美地工作。我想根据第 12 列和第 3 列的条件向文件中添加一个新列,该列显示“非编码或编码”(如果第 12 列具有子字符串 RNA 或 mir- and/or 第 3 列 ==“假基因,则第 1 列应该阅读非编码,否则编码)。
#file
X FlyBase pseudogene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821"; transcript_id "FBtr0307588"; transcript_symbol "CR32821-RB";
X FlyBase pseudogene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275"; transcript_id "FBtr0070097"; transcript_symbol "CR18275-RA";
X FlyBase pseudogene 5832298 5832368 . + . gene_id "FBgn0052761"; gene_symbol "tRNA:Glu-CTC-6-1Psi"; transcript_id "FBtr0070818"; transcript_symbol "tRNA:Glu-CTC-6-1Psi-RA";
X FlyBase pseudogene 6361496 6362960 . - . gene_id "FBgn0016974"; gene_symbol "swaPsi"; transcript_id "FBtr0070923"; transcript_symbol "swaPsi-RA";
X FlyBase pseudogene 6361496 6363310 . - . gene_id "FBgn0016974"; gene_symbol "swaPsi"; transcript_id "FBtr0334014"; transcript_symbol "swaPsi-RB";
X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
X FlyBase gene 1482492 1482590 . - . gene_id "FBgn0044508"; gene_symbol "snoRNA:M";
X FlyBase gene 2330159 2330826 . + . gene_id "FBgn0053218"; gene_symbol "lncRNA:CR33218";
X FlyBase gene 3427452 3427523 . - . gene_id "FBgn0052493"; gene_symbol "tRNA:Gln-TTG-2-1";
X FlyBase gene 3819699 3819770 . + . gene_id "FBgn0052785"; gene_symbol "tRNA:Gln-CTG-2-1";
X FlyBase gene 3827622 3827693 . + . gene_id "FBgn0025118"; gene_symbol "tRNA:Pro-CGG-3-1";
2L FlyBase gene 825969 833241 . + . gene_id "FBgn0010583"; gene_symbol "dock";
2L FlyBase gene 852768 854539 . + . gene_id "FBgn0020545"; gene_symbol "kraken";
2L FlyBase gene 855337 856639 . + . gene_id "FBgn0031288"; gene_symbol "CG13949";
2L FlyBase gene 860197 861806 . + . gene_id "FBgn0031289"; gene_symbol "CG13950";
2L FlyBase gene 877302 878270 . + . gene_id "FBgn0002936"; gene_symbol "ninaA";
#command
awk '{ if( ~ /RNA/ || ~ /mir-/ || == "pseudogene") ="non-coding"; else ="coding"; print }' a.gene-pseudogene_all_dmel-all-r6.40.gtf
代码有效,但它替换了第 1 列。这不是我想要的,我想在第 1 列之前添加这个新列(因此它成为新的第 1 列)。
如何调整?
您可以(有效地)通过将 </code> 转换为 <code>something OFS
来添加一个新列。您并没有真正创建一个新列(</code> 仍然指原始的第二列,<code>
指的是“两个”新列),但这在这种情况下并不重要:
awk '{
x = ( ~/RNA|mir-/ || =="pseudogene" ) ? "non-" : ""
= x "coding" OFS
print
}' a.gene-pseudogene_all_dmel-all-r6.40.gtf
上述技术可用于在任何列之前(或之后)插入。因为我们在第一列之前添加(或者如果我们在最后一列之后附加),代码可以通过避免分配来提高效率:
awk '{
x = ( ~/RNA|mir-/ || =="pseudogene" ) ? "non-" : ""
print x "coding", [=11=]
}' a.gene-pseudogene_all_dmel-all-r6.40.gtf
我有一个非常简单的衬垫,几乎可以完美地工作。我想根据第 12 列和第 3 列的条件向文件中添加一个新列,该列显示“非编码或编码”(如果第 12 列具有子字符串 RNA 或 mir- and/or 第 3 列 ==“假基因,则第 1 列应该阅读非编码,否则编码)。
#file
X FlyBase pseudogene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821"; transcript_id "FBtr0307588"; transcript_symbol "CR32821-RB";
X FlyBase pseudogene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275"; transcript_id "FBtr0070097"; transcript_symbol "CR18275-RA";
X FlyBase pseudogene 5832298 5832368 . + . gene_id "FBgn0052761"; gene_symbol "tRNA:Glu-CTC-6-1Psi"; transcript_id "FBtr0070818"; transcript_symbol "tRNA:Glu-CTC-6-1Psi-RA";
X FlyBase pseudogene 6361496 6362960 . - . gene_id "FBgn0016974"; gene_symbol "swaPsi"; transcript_id "FBtr0070923"; transcript_symbol "swaPsi-RA";
X FlyBase pseudogene 6361496 6363310 . - . gene_id "FBgn0016974"; gene_symbol "swaPsi"; transcript_id "FBtr0334014"; transcript_symbol "swaPsi-RB";
X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
X FlyBase gene 1482492 1482590 . - . gene_id "FBgn0044508"; gene_symbol "snoRNA:M";
X FlyBase gene 2330159 2330826 . + . gene_id "FBgn0053218"; gene_symbol "lncRNA:CR33218";
X FlyBase gene 3427452 3427523 . - . gene_id "FBgn0052493"; gene_symbol "tRNA:Gln-TTG-2-1";
X FlyBase gene 3819699 3819770 . + . gene_id "FBgn0052785"; gene_symbol "tRNA:Gln-CTG-2-1";
X FlyBase gene 3827622 3827693 . + . gene_id "FBgn0025118"; gene_symbol "tRNA:Pro-CGG-3-1";
2L FlyBase gene 825969 833241 . + . gene_id "FBgn0010583"; gene_symbol "dock";
2L FlyBase gene 852768 854539 . + . gene_id "FBgn0020545"; gene_symbol "kraken";
2L FlyBase gene 855337 856639 . + . gene_id "FBgn0031288"; gene_symbol "CG13949";
2L FlyBase gene 860197 861806 . + . gene_id "FBgn0031289"; gene_symbol "CG13950";
2L FlyBase gene 877302 878270 . + . gene_id "FBgn0002936"; gene_symbol "ninaA";
#command
awk '{ if( ~ /RNA/ || ~ /mir-/ || == "pseudogene") ="non-coding"; else ="coding"; print }' a.gene-pseudogene_all_dmel-all-r6.40.gtf
代码有效,但它替换了第 1 列。这不是我想要的,我想在第 1 列之前添加这个新列(因此它成为新的第 1 列)。
如何调整?
您可以(有效地)通过将 </code> 转换为 <code>something OFS
来添加一个新列。您并没有真正创建一个新列(</code> 仍然指原始的第二列,<code>
指的是“两个”新列),但这在这种情况下并不重要:
awk '{
x = ( ~/RNA|mir-/ || =="pseudogene" ) ? "non-" : ""
= x "coding" OFS
print
}' a.gene-pseudogene_all_dmel-all-r6.40.gtf
上述技术可用于在任何列之前(或之后)插入。因为我们在第一列之前添加(或者如果我们在最后一列之后附加),代码可以通过避免分配来提高效率:
awk '{
x = ( ~/RNA|mir-/ || =="pseudogene" ) ? "non-" : ""
print x "coding", [=11=]
}' a.gene-pseudogene_all_dmel-all-r6.40.gtf