文本处理问题：删除其中一列不包含特定值的行

Question

我有一个制表符分隔的文件，如下所示：

input_sequence  match_sequence  score   receptor_group  epitope antigen organism    
ASRPPGGVNEQF    ASRPPGGVNEQF    1.00    25735   EPLPQGQLTAY surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]  SARS-CoV2
ASSYSGGYEQY ASSYSGGYEQY 1.00    33843   KTAYSHLSTSK polymerase  Hepatitis B virus (hepatitis B virus (HBV))
ASSYSGGYEQY ASSYSGGYEQY 1.00    131430  KLSYGIATV   orf1ab polyprotein [Severe acute respiratory syndrome coronavirus 2]    SARS-CoV2
ASSYSGGYEQY ASSFSGGYEQY 0.97    82603   FTISVTTEIL  surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]  SARS-CoV2
ASSYSGGYEQY ASSYAGGYEQY 0.98    133155  FVCNLLLLFVTVYSHLLLV ORF3a protein [Severe acute respiratory syndrome coronavirus 2] SARS-CoV2
ASSLFGSTDTQY    ASSLFGSTDTQY    1.00    92508   FTISVTTEIL  surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]  SARS-CoV2

我想保留 'input_sequence'，它只与 'organism' = SARS-CoV2 匹配，没有其他匹配。所以在这个例子中，我将只保留第 2 行和第 7 行，并丢弃第 3、4、5、6 行，因为这里 'input_sequence' 也感染了乙型肝炎病毒。

我的文件中总共有超过 20.000 行。

需要的结果：

input_sequence  match_sequence  score   receptor_group  epitope antigen organism    
ASRPPGGVNEQF    ASRPPGGVNEQF    1.00    25735   EPLPQGQLTAY surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]  SARS-CoV2
ASSLFGSTDTQY    ASSLFGSTDTQY    1.00    92508   FTISVTTEIL  surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]  SARS-CoV2

有没有办法使用 awk 或 bash 快速完成此操作（无需编写长脚本）？欢迎任何提示。

我想用 awk 来计算第 1 列中每个值的出现次数和第 7 列中 SARS-COV2 的出现次数，然后只保留匹配的那些...但我不知道该怎么做.我只做到了这一点（计算第一列中出现的次数）：

awk '{for(i=1;i<=NF;i++)if($i ~ /^/)x++;print x;x=0}' file

谢谢！

Answer 1

awk '
    NR==1                              # Print first line (header)
    $NF != "SARS-CoV2" { bad[] }     # Collect primary keys of "bad"  records based on content in last field
    $NF == "SARS-CoV2" { good[]=[=10=] } # Collect primary keys of "good" records with opposite check
    END {
        for(v in bad) delete good[v]   # Remove primary keys from "good" records that also appear in "bad" records
        for(v in good) print good[v]   # Print the "good" rows
    }
' file

传递文件一次，这可能是一个解决方案。这将删除任何重复的条目。

Answer 2

您需要对输入进行两次传递。

第一遍生成一个数组，其键是具有 SARS-CoV2 以外的有机体的输入序列。第二遍检查当前输入序列是否在该数组中。如果不是，它打印该行。

awk -F'\t' 'NR==FNR {if ( != "SARS-CoV2") {a[[=10=]]=1}; next}
            !a[[=10=]]' file file

Answer 3

您可以考虑在第 1 列加入同一文件的这个 awk：

awk -F'\t' 'NR==FNR {$NF != "SARS-CoV2" && bad[]; next}
FNR == 1 || !( in bad)' file{,} | column -s $'\t' -t

input_sequence  match_sequence  score  receptor_group  epitope      antigen                                                                          organism
ASRPPGGVNEQF    ASRPPGGVNEQF    1.00   25735           EPLPQGQLTAY  Trans-activator protein BZLF1 [Severe acute respiratory syndrome coronavirus 2]  SARS-CoV2
ASSLFGSTDTQY    ASSLFGSTDTQY    1.00   92508           FTISVTTEIL   surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]           SARS-CoV2

PS: column -s $'\t' -t 仅用于表格显示。你可以删除它。

如果您想根据第一列删除会费，请使用：

awk -F'\t' 'NR==FNR {$NF != "SARS-CoV2" && bad[]; next}
FNR == 1 || (!( in bad) && !seen[]++)' file{,}

文本处理问题：删除其中一列不包含特定值的行

text processing question: remove rows where one column does not contain a particular value

bash

awk