同一行上的多个 grep 输出
Multiple grep output on same line
这似乎是一个非常琐碎的问题,但我没有足够的使用 grep
和 echo
的经验来自己回答。我看过 here and here 没有成功。
我有一个这样开头的文件(.gff 文件)超过 1,000,000 行。
NW_007577731.1 RefSeq region 1 3345205 . + . ID=id0;Dbxref=taxon:144197;Name=Unknown;chromosome=Unknown;collection-date=16-Aug-2005;country=USA: Emerald Reef%2C Florida;gbkey=Src;genome=genomic;isolate=25-593;lat-lon=25.6748 N 80.0982 W;mol_type=genomic DNA;sex=male
NW_007577731.1 Gnomon gene 7982 24854 . - . ID=gene0;Dbxref=GeneID:103352799;Name=LOC103352799;gbkey=Gene;gene=LOC103352799;gene_biotype=protein_coding
NW_007577731.1 Gnomon mRNA 7982 24854 . - . ID=rna0;Parent=gene0;Dbxref=GeneID:103352799,Genbank:XM_008279367.1;Name=XM_008279367.1;gbkey=mRNA;gene=LOC103352799;model_evidence=Supporting evidence includes similarity to: 22 Proteins%2C and 73%25 coverage of the annotated genomic feature by RNAseq alignments;product=homer protein homolog 3-like;transcript_id=XM_008279367.1
NW_007577731.1 RefSeq region 1 3345205 . + . ID=id0;Dbxref=taxon:144197;Name=Unknown;chromosome=Unknown;collection-date=16-Aug-2005;country=USA: Emerald Reef%2C Florida;gbkey=Src;genome=genomic;isolate=25-593;lat-lon=25.6748 N 80.0982 W;mol_type=genomic DNA;sex=male
NW_007577731.1 Gnomon gene 7982 24854 . - . ID=gene0;Dbxref=GeneID:103352799;Name=LOC103352799;gbkey=Gene;gene=LOC103352799;gene_biotype=protein_coding
NW_007577731.1 Gnomon mRNA 7982 24854 . - . ID=rna0;Parent=gene0;Dbxref=GeneID:103352799,Genbank:XM_008279367.1;Name=XM_008279367.1;gbkey=mRNA;gene=LOC103352799;model_evidence=Supporting evidence includes similarity to: 22 Proteins%2C and 73%25 coverage of the annotated genomic feature by RNAseq alignments;product=homer protein homolog 3-like;transcript_id=XM_008279367.1
我想在第三列中包含 mRNA
的行上进行 grep 以获得此制表符分隔的输出(字段 gene=
、product=
、transcript_id=
中的值).
LOC103352799 homer protein homolog 3-like XM_008279367.1
LOC103352799 homer protein homolog 3-like XM_008279367.1
由于非常缺乏优雅,我可以使用
分别获得 3 列
grep "mRNA\t" myfile.gff|sed s/gene=/@/|cut -f2 -d"@" |cut -f1 -d";"
grep "mRNA\t" myfile.gff|sed s/product=/@/|cut -f2 -d"@" |cut -f1 -d";"
grep "mRNA\t" myfile.gff|sed s/transcript_id=/@/|cut -f2 -d"@" |cut -f1 -d";"
但是如何将这 3 个命令的输出附加到同一行?我试过了
echo -e "`grep "mRNA\t" myfile.gff|sed s/gene=/@/|cut -f2 -d"@" |cut -f1 -d";"`\t`grep "mRNA\t" myfile.gff|sed s/product=/@/|cut -f2 -d"@" |cut -f1 -d";"`\t`grep "mRNA\t" myfile.gff|sed s/transcript_id=/@/|cut -f2 -d"@" |cut -f1 -d";"`"
但这是输出:
LOC103352799
LOC103352799 homer protein homolog 3-like
homer protein homolog 3-like XM_008279367.1
XM_008279367.1
非常感谢您的帮助!
使用 awk:
$ awk 'BEGIN {
FS=OFS="\t" # field separators to tab
k="gene,product,transcript_id" # keyword list
split(k,a,",") # split keywords to a hash for matching
for(i in a) # values to keys
p[a[i]]
}
=="mRNA" {
b="" # reset buffer b
split(,a,"[=;]") # split the data to a hash
for(i in a) # iterate and search
if(a[i] in p) # ... for keywords, if match,
b=b (b==""?"":OFS) a[i+1] # ... value is the next, buffer
print b # output buffer
}' file
LOC103352799 homer protein homolog 3-like XM_008279367.1
LOC103352799 homer protein homolog 3-like XM_008279367.1
说到单行本,这里是 sed
中的一个:
sed -nE '/\tmRNA\t/ { s/.*gene=([^;]+).*product=([^;]+).*transcript_id=([^;]+)/\t\t/g;p }' file
唯一的假设是 gene
、product
和 transcript_id
字段的固定顺序。这可以通过一些改变来解决,但考虑到正则表达式的可读性。
这似乎是一个非常琐碎的问题,但我没有足够的使用 grep
和 echo
的经验来自己回答。我看过 here and here 没有成功。
我有一个这样开头的文件(.gff 文件)超过 1,000,000 行。
NW_007577731.1 RefSeq region 1 3345205 . + . ID=id0;Dbxref=taxon:144197;Name=Unknown;chromosome=Unknown;collection-date=16-Aug-2005;country=USA: Emerald Reef%2C Florida;gbkey=Src;genome=genomic;isolate=25-593;lat-lon=25.6748 N 80.0982 W;mol_type=genomic DNA;sex=male
NW_007577731.1 Gnomon gene 7982 24854 . - . ID=gene0;Dbxref=GeneID:103352799;Name=LOC103352799;gbkey=Gene;gene=LOC103352799;gene_biotype=protein_coding
NW_007577731.1 Gnomon mRNA 7982 24854 . - . ID=rna0;Parent=gene0;Dbxref=GeneID:103352799,Genbank:XM_008279367.1;Name=XM_008279367.1;gbkey=mRNA;gene=LOC103352799;model_evidence=Supporting evidence includes similarity to: 22 Proteins%2C and 73%25 coverage of the annotated genomic feature by RNAseq alignments;product=homer protein homolog 3-like;transcript_id=XM_008279367.1
NW_007577731.1 RefSeq region 1 3345205 . + . ID=id0;Dbxref=taxon:144197;Name=Unknown;chromosome=Unknown;collection-date=16-Aug-2005;country=USA: Emerald Reef%2C Florida;gbkey=Src;genome=genomic;isolate=25-593;lat-lon=25.6748 N 80.0982 W;mol_type=genomic DNA;sex=male
NW_007577731.1 Gnomon gene 7982 24854 . - . ID=gene0;Dbxref=GeneID:103352799;Name=LOC103352799;gbkey=Gene;gene=LOC103352799;gene_biotype=protein_coding
NW_007577731.1 Gnomon mRNA 7982 24854 . - . ID=rna0;Parent=gene0;Dbxref=GeneID:103352799,Genbank:XM_008279367.1;Name=XM_008279367.1;gbkey=mRNA;gene=LOC103352799;model_evidence=Supporting evidence includes similarity to: 22 Proteins%2C and 73%25 coverage of the annotated genomic feature by RNAseq alignments;product=homer protein homolog 3-like;transcript_id=XM_008279367.1
我想在第三列中包含 mRNA
的行上进行 grep 以获得此制表符分隔的输出(字段 gene=
、product=
、transcript_id=
中的值).
LOC103352799 homer protein homolog 3-like XM_008279367.1
LOC103352799 homer protein homolog 3-like XM_008279367.1
由于非常缺乏优雅,我可以使用
分别获得 3 列grep "mRNA\t" myfile.gff|sed s/gene=/@/|cut -f2 -d"@" |cut -f1 -d";"
grep "mRNA\t" myfile.gff|sed s/product=/@/|cut -f2 -d"@" |cut -f1 -d";"
grep "mRNA\t" myfile.gff|sed s/transcript_id=/@/|cut -f2 -d"@" |cut -f1 -d";"
但是如何将这 3 个命令的输出附加到同一行?我试过了
echo -e "`grep "mRNA\t" myfile.gff|sed s/gene=/@/|cut -f2 -d"@" |cut -f1 -d";"`\t`grep "mRNA\t" myfile.gff|sed s/product=/@/|cut -f2 -d"@" |cut -f1 -d";"`\t`grep "mRNA\t" myfile.gff|sed s/transcript_id=/@/|cut -f2 -d"@" |cut -f1 -d";"`"
但这是输出:
LOC103352799
LOC103352799 homer protein homolog 3-like
homer protein homolog 3-like XM_008279367.1
XM_008279367.1
非常感谢您的帮助!
使用 awk:
$ awk 'BEGIN {
FS=OFS="\t" # field separators to tab
k="gene,product,transcript_id" # keyword list
split(k,a,",") # split keywords to a hash for matching
for(i in a) # values to keys
p[a[i]]
}
=="mRNA" {
b="" # reset buffer b
split(,a,"[=;]") # split the data to a hash
for(i in a) # iterate and search
if(a[i] in p) # ... for keywords, if match,
b=b (b==""?"":OFS) a[i+1] # ... value is the next, buffer
print b # output buffer
}' file
LOC103352799 homer protein homolog 3-like XM_008279367.1
LOC103352799 homer protein homolog 3-like XM_008279367.1
说到单行本,这里是 sed
中的一个:
sed -nE '/\tmRNA\t/ { s/.*gene=([^;]+).*product=([^;]+).*transcript_id=([^;]+)/\t\t/g;p }' file
唯一的假设是 gene
、product
和 transcript_id
字段的固定顺序。这可以通过一些改变来解决,但考虑到正则表达式的可读性。