Bash: 文本处理命令

Question

我已经能够用一个命令一行来做我想做的事，但我知道必须有一些更优雅的方式来做我正在做的事情。请告诉我你的方法是什么...我想学习更复杂的处理文本文件的方法...

原文件是一个vcf文件，长这样

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20180307
##source=PLINKv1.90
##contig=<ID=1,length=249214117>
##contig=<ID=2,length=242842533>
##contig=<ID=3,length=197896741>
...
...
...
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT
22  16258171    22:16258171:D:3 A   .   .   .   .   GT
22  16258174    22:16258174:T:C T   .   .   .   .   GT
22  16258183    22:16258183:A:T A   .   .   .   .   GT
22  16258189    22:16258189:G:T G   .   .   .   .   GT

我的目标是生成一个如下所示的文件：

22  16258171  16258171  D  3
22  16258174  16258174  T  C
22  16258183  16258183  A  T
22  16258189  16258189  G  T
22  16258211  16258211  A  G
22  16258211  16258211  A  T
22  16258220  16258220  T  G
22  16258221  16258221  C  T
22  16258224  16258224  C  T
22  16258227  16258227  G  A

我做了以下步骤来实现最终目标，但它太麻烦了，也太丑了...

#remove comments
sed '/^[[:blank:]]*#/d;s/#.*//' chr22.vcf > no_comment_chr22.vcf

#take out the third columns for splitting
cut -d $'\t' -f 3 no_comment_chr22.vcf > no_comment_chr22.col3_to_split.txt

#Split string by delimiter and get N-th element, use as col4
cut -d':' -f3 no_comment_chr22.col3_to_split.txt > chr22_as_col4.txt

#Split string by delimiter and get N-th element, use as col5
cut -d':' -f4 no_comment_chr22.col3_to_split.txt > chr22_as_col5.txt

#get first 2 columns
cut -d $'\t' -f 1-2 no_comment_chr22.vcf > no_comment_chr22.col1to2.txt

#get the second column as col3 
cut -d $'\t' -f 2 no_comment_chr22.vcf > no_comment_chr22.ascol3.txt

#Combine files column-wise
paste no_comment_chr22.col1to2.txt no_comment_chr22.ascol3.txt chr22_as_col4.txt chr22_as_col5.txt | column -s $'\t' -t  > chr22_input_5cols.txt

我能够得到我需要的东西，但是..天哪，这太难看了。请告诉我人们做了什么来提高他们的文本处理技能以及如何改进这样的事情..谢谢！！

Answer 1

你可以试试这个 sed

sed -E '
/^#/d
s/(([0-9]*[[:blank:]]*){2})[^:]*((:[^:[[:blank:]]*){3}).*//
s/:/ /g
s/[[:blank:]]{1,}/  /g
' infile

Answer 2

使用awk：

awk -F'(:| +)' '/^#/ {next} {print ,,,,}' sample.vcf


22 16258171 16258171 D 3
22 16258174 16258174 T C
22 16258183 16258183 A T
22 16258189 16258189 G T

这是指定正则表达式作为字段分隔符（-F），然后忽略注释行（^#）或打印相应的字段（1,2,4,5,6 ).

Bash: 文本处理命令

Bash: text processing command

bash

cut

sed

paste

vcf-variant-call-format