打印两个文件和行之间的匹配项,同时按参考文件排序

Print matches between two files and line after, while ordering by reference file

我有一个参考文件

NP_001041718.1
XP_021405980.1
NP_001041719.1
XP_021385112.1
NP_001041721.1
XP_021394530.1
NP_001041722.1
XP_021394327.1
NP_001041723.1
XP_021400667.1

我需要在如下所示的目标文件中捕获匹配项和下一行,并保持参考文件中的顺序

NP_001041718.1
DVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSRTKEGVVHGVTTVAEKTKEQVSNVGGAVVTGVTAVAQKTVEGAGNIAAATGLVKKDQLAKQNEEGFLQEGMVNNTGVAVDPENEAYEMPPEEEYQDYEPEA
NP_001041719.1
GKQNSKLRPEVMQDLLESTDFTEHEIQEWYKGFLRDCPSGHLSMEEFKKIYGNFFPYGDASKFAEHVFRTFDANGDGTIDFREFIIALSVTSRGKLEQKLKWAFSMYDLDGNGYISKSEMLEIVQAIYKMVSSVMKMPEDESTPEKRTEKIFRQMDTNRDGKLSLEEFIRGAKSDPSIVRLLQCDPSSAGQF
NP_001041721.1
TMESGAENQQSGDAAGTEAETQQMTVQAQPQIATLAQVSMPAAHATSSAPTVTLVQLPNGQTVQVHGVIQAAQPSVIQSPQVQTVQISTIAESEDSQESVDSVTDSQKRREILSRRPSYRKILNDLSSDAPGVPRIEEEKSEEETAAPAIATVTVPTPIYQTSSGQYIAITQGGAIQLSNNGTDGVQGLQTLTMTNAAATQPGTTILQYAQTTDGQQILVPSNQVVVQAASGDVQTYQIRTAPTSTIAPGVVMASSPALPTQPAEEAARKREVRLMKNREAARECRRKKKEYVKCLENRVAVLENQNKTLIEELKALKDLYCHKSD
NP_001041722.1
RVNESELNSSVLPRDPPAEGAPRRQPWVTSTLAAILIFTIAVDLLGNLLVILSVYRNKKLRNAGNVFVVSLAVADLIVAIYPYPLVLTSVFHNGWKLGYLHCQISGFLMGLSVIGSIFNITGIAINRYCYICHSLKYDKLYSDRNSLCYIVLIWLLTFVAIVPNLFVGSLQYDPRIYSCTFAQSVSSAYTIAVVFFHFLLPIAVVTFCYLRIWILVIQVRRRVKPDNNPRLKPHDFRNFVTMFVVFVLFAVCWAPLNFIGIAVAVNPKTVIPRIPEWLFVSSYYMAYFNSCLNAIVYGLLNQNFRREYKRIIVNFCTAKVFFQDSSNDAGDRMRSKPSPLITNNNQVKVDSV
NP_001041723.1
LENGSLRNCCDPGGRGRLGLAEREAAAAGAPRPAWVVPVLSSVLIFTTVVDILGNLLVILSVFKNRKLRNSGNAFVVSLALADLVVALYPYPLVLLAIFHNGWTLGETHCKASGFVMGLSVIGSIFNITAIAINRYCYICHSFAYDKVYSCWNTMLYVSLVWILTVIATVPNFFVGSLKYDPRIYSCTFVQTASSYYTIAVVVIHFIVPITIVSFCYLRIWVLVLQVRRRVKSETKPRLKPSDFRNFLTMFVVFVIFAFCWAPLNFIGLAVAIDPTEMAPKVPEWLFIISYLMAYFNSCLNAIIYGLLNQNFRNEYKRISMSLWMPRLFFQDTSKGGTDGQKSKPSPALNNNNQMKTETL
XP_021405980.1
DVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSRTKEGVVHGVTTVAEKTKEQVSNVGGAVVTGVTAVAQKTVEGAGNIAAATGLVKKDQLAKQNEEGFLQEGMVNNTGVAVGPENEAYKMPPEEEYQDYEPEA
XP_021385112.1
GKQNSKLRPEVMQDLLESTDFTEHEIQEWYKGFLRDCPSGHLSMEEFKKIYGNFFPYGDASKFAEHVFRTFDANGDGTIDFREFIIALSVTSRGKLEQKLKWAFSMYDLDGNGYISKSEMLEIVQAIYKMVSSVMKMPEDESTPEKRTEKIFRQMDTNRDGKLSLEEFIRGAKSDPSIVRLLQCDPSSAGQF
XP_021394530.1
TMESGAENQQSGDAAGTEAETQQMTVQAQPQIATLAQVSMPAAHATSSAPTVTLVQLPNGQTVQVHGVIQAAQPSVIQSPQVQTVQISTIAESEDSQESVDSVTDSQKRREILSRRPSYRKILNDLSSDAPGVPRIEEEKSEEETAAPAIATVTVPTPIYQTSSGQYIAITQGGAIQLSNNGTDGVQGLQTLTMTNAAATQPGTTILQYAQTTDGQQILVPSNQVVVQAASGDVQTYQIRTAPTSTIAPGVVMASSPALPTQPAEEAARKREVRLMKNREAARECRRKKKEYVKCLENRVAVLENQNKTLIEELKALKDLYCHKSD
XP_021394327.1
RVNESELNSSVLPRDPPAEGAPRRQPWVTSTLAAILIFTIAVDLLGNLLVILSVYRNKKLRNAGNVFVVSLAVADLIVAIYPYPLVLTSVFHNGWKLGYLHCQISGFLMGLSVIGSIFNITGIAINRYCYICHSLKYDKLYSDRNSLCYIVLIWLLTFVAIVPNLFVGSLQYDPRIYSCTFAQSVSSAYTIAVVFFHFLLPIAVVTFCYLRIWILVIQVRRRVKPDNNPRLKPHDFRNFVTMFVVFVLFAVCWAPLNFIGIAVAVNPKTVIPRIPEWLFVSSYYMAYFNSCLNAIVYGLLNQNFRREYKRIIVNFCTAKVFFQDSSNDAGDRMRSKPSPLITNNNQVKVDSV
XP_021400667.1
LENGSLRNCCDPGGRGRLGLAEREAAAAGAPRPAWVVPVLSSVLIFTTVVDILGNLLVILSVFKNRKLRNSGNAFVVSLALADLVVALYPYPLVLLAIFHNGWTLGETHCKASGFVMGLSVIGSIFNITAIAINRYCYICHSFAYDKVYSCWNTMLYVSLVWILTVIATVPNFFVGSLKYDPRIYSCTFVQTASSYYTIAVVVIHFIVPITIVSFCYLRIWVLVLQVRRRVKSETKPRLKPSDFRNFLTMFVVFVIFAFCWAPLNFIGLAVAIDPTEMAPKVPEWLFIISYLMAYFNSCLNAIIYGLLNQNFRNEYKRILMSLWMPRLFFQDTSKGGTDGQKSKPSPALNNNNQMKTETI

所以输出看起来像

    NP_001041718.1
    DVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSRTKEGVVHGVTTVAEKTKEQVSNVGGAVVTGVTAVAQKTVEGAGNIAAATGLVKKDQLAKQNEEGFLQEGMVNNTGVAVDPENEAYEMPPEEEYQDYEPEA
    XP_021405980.1
    DVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSRTKEGVVHGVTTVAEKTKEQVSNVGGAVVTGVTAVAQKTVEGAGNIAAATGLVKKDQLAKQNEEGFLQEGMVNNTGVAVGPENEAYKMPPEEEYQDYEPEA
    NP_001041719.1
    GKQNSKLRPEVMQDLLESTDFTEHEIQEWYKGFLRDCPSGHLSMEEFKKIYGNFFPYGDASKFAEHVFRTFDANGDGTIDFREFIIALSVTSRGKLEQKLKWAFSMYDLDGNGYISKSEMLEIVQAIYKMVSSVMKMPEDESTPEKRTEKIFRQMDTNRDGKLSLEEFIRGAKSDPSIVRLLQCDPSSAGQF
    XP_021385112.1
    ....

我知道如何在保持来自 ref awk 'FNR==NR {a[]=[=13=]; next}; in a {getline} {print a[]}' target ref 的顺序的同时找到目标中的匹配项,但我不知道如何打印后面的行。我知道如何打印 grep -A 1 -f ref target 之后的行,但它会重新排序目标文件

您可以颠倒传递给 awk 的文件的顺序,首先传递 ref 文件,然后创建一个数字递增的数组以保持键和值的顺序。

您可以不使用 getline,而是将最后一行保存在变量中,检查当前行是否存在于以第一个文件的值作为键存储的数组中。

如果是,则将最后一行加上当前行存储在一个新数组中 final,并在 END 块中循环该数组。

awk '{
  if (FNR==NR) {
    a[]=i++; next
  }
  if (last in a) {
    final[a[last]] = last RS 
  }
  last = 
}
END { for (i in final) print final[i] }
' ref target

输出

NP_001041718.1
DVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSRTKEGVVHGVTTVAEKTKEQVSNVGGAVVTGVTAVAQKTVEGAGNIAAATGLVKKDQLAKQNEEGFLQEGMVNNTGVAVDPENEAYEMPPEEEYQDYEPEA
XP_021405980.1
DVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSRTKEGVVHGVTTVAEKTKEQVSNVGGAVVTGVTAVAQKTVEGAGNIAAATGLVKKDQLAKQNEEGFLQEGMVNNTGVAVGPENEAYKMPPEEEYQDYEPEA
NP_001041719.1
GKQNSKLRPEVMQDLLESTDFTEHEIQEWYKGFLRDCPSGHLSMEEFKKIYGNFFPYGDASKFAEHVFRTFDANGDGTIDFREFIIALSVTSRGKLEQKLKWAFSMYDLDGNGYISKSEMLEIVQAIYKMVSSVMKMPEDESTPEKRTEKIFRQMDTNRDGKLSLEEFIRGAKSDPSIVRLLQCDPSSAGQF
XP_021385112.1
GKQNSKLRPEVMQDLLESTDFTEHEIQEWYKGFLRDCPSGHLSMEEFKKIYGNFFPYGDASKFAEHVFRTFDANGDGTIDFREFIIALSVTSRGKLEQKLKWAFSMYDLDGNGYISKSEMLEIVQAIYKMVSSVMKMPEDESTPEKRTEKIFRQMDTNRDGKLSLEEFIRGAKSDPSIVRLLQCDPSSAGQF
NP_001041721.1
TMESGAENQQSGDAAGTEAETQQMTVQAQPQIATLAQVSMPAAHATSSAPTVTLVQLPNGQTVQVHGVIQAAQPSVIQSPQVQTVQISTIAESEDSQESVDSVTDSQKRREILSRRPSYRKILNDLSSDAPGVPRIEEEKSEEETAAPAIATVTVPTPIYQTSSGQYIAITQGGAIQLSNNGTDGVQGLQTLTMTNAAATQPGTTILQYAQTTDGQQILVPSNQVVVQAASGDVQTYQIRTAPTSTIAPGVVMASSPALPTQPAEEAARKREVRLMKNREAARECRRKKKEYVKCLENRVAVLENQNKTLIEELKALKDLYCHKSD
XP_021394530.1
TMESGAENQQSGDAAGTEAETQQMTVQAQPQIATLAQVSMPAAHATSSAPTVTLVQLPNGQTVQVHGVIQAAQPSVIQSPQVQTVQISTIAESEDSQESVDSVTDSQKRREILSRRPSYRKILNDLSSDAPGVPRIEEEKSEEETAAPAIATVTVPTPIYQTSSGQYIAITQGGAIQLSNNGTDGVQGLQTLTMTNAAATQPGTTILQYAQTTDGQQILVPSNQVVVQAASGDVQTYQIRTAPTSTIAPGVVMASSPALPTQPAEEAARKREVRLMKNREAARECRRKKKEYVKCLENRVAVLENQNKTLIEELKALKDLYCHKSD
NP_001041722.1
RVNESELNSSVLPRDPPAEGAPRRQPWVTSTLAAILIFTIAVDLLGNLLVILSVYRNKKLRNAGNVFVVSLAVADLIVAIYPYPLVLTSVFHNGWKLGYLHCQISGFLMGLSVIGSIFNITGIAINRYCYICHSLKYDKLYSDRNSLCYIVLIWLLTFVAIVPNLFVGSLQYDPRIYSCTFAQSVSSAYTIAVVFFHFLLPIAVVTFCYLRIWILVIQVRRRVKPDNNPRLKPHDFRNFVTMFVVFVLFAVCWAPLNFIGIAVAVNPKTVIPRIPEWLFVSSYYMAYFNSCLNAIVYGLLNQNFRREYKRIIVNFCTAKVFFQDSSNDAGDRMRSKPSPLITNNNQVKVDSV
XP_021394327.1
RVNESELNSSVLPRDPPAEGAPRRQPWVTSTLAAILIFTIAVDLLGNLLVILSVYRNKKLRNAGNVFVVSLAVADLIVAIYPYPLVLTSVFHNGWKLGYLHCQISGFLMGLSVIGSIFNITGIAINRYCYICHSLKYDKLYSDRNSLCYIVLIWLLTFVAIVPNLFVGSLQYDPRIYSCTFAQSVSSAYTIAVVFFHFLLPIAVVTFCYLRIWILVIQVRRRVKPDNNPRLKPHDFRNFVTMFVVFVLFAVCWAPLNFIGIAVAVNPKTVIPRIPEWLFVSSYYMAYFNSCLNAIVYGLLNQNFRREYKRIIVNFCTAKVFFQDSSNDAGDRMRSKPSPLITNNNQVKVDSV
NP_001041723.1
LENGSLRNCCDPGGRGRLGLAEREAAAAGAPRPAWVVPVLSSVLIFTTVVDILGNLLVILSVFKNRKLRNSGNAFVVSLALADLVVALYPYPLVLLAIFHNGWTLGETHCKASGFVMGLSVIGSIFNITAIAINRYCYICHSFAYDKVYSCWNTMLYVSLVWILTVIATVPNFFVGSLKYDPRIYSCTFVQTASSYYTIAVVVIHFIVPITIVSFCYLRIWVLVLQVRRRVKSETKPRLKPSDFRNFLTMFVVFVIFAFCWAPLNFIGLAVAIDPTEMAPKVPEWLFIISYLMAYFNSCLNAIIYGLLNQNFRNEYKRISMSLWMPRLFFQDTSKGGTDGQKSKPSPALNNNNQMKTETL
XP_021400667.1
LENGSLRNCCDPGGRGRLGLAEREAAAAGAPRPAWVVPVLSSVLIFTTVVDILGNLLVILSVFKNRKLRNSGNAFVVSLALADLVVALYPYPLVLLAIFHNGWTLGETHCKASGFVMGLSVIGSIFNITAIAINRYCYICHSFAYDKVYSCWNTMLYVSLVWILTVIATVPNFFVGSLKYDPRIYSCTFVQTASSYYTIAVVVIHFIVPITIVSFCYLRIWVLVLQVRRRVKSETKPRLKPSDFRNFLTMFVVFVIFAFCWAPLNFIGLAVAIDPTEMAPKVPEWLFIISYLMAYFNSCLNAIIYGLLNQNFRNEYKRILMSLWMPRLFFQDTSKGGTDGQKSKPSPALNNNNQMKTETI

使用 getline into a variable 的变体:

awk '{
  if (FNR==NR) {
    i++;a[i]=;b[]=i;next
  }
  if ( in b && (getline tmp) > 0) {
    final[b[]] = a[b[]] RS tmp
  }
}
END { for (i in final) print final[i] }
' ref target

请您尝试以下操作:

awk 'FNR==NR {                          # process "target" file
    if (FNR%2) a[key=]=[=10=]             # store odd lines in array a
    else b[key]=[=10=]                      # store even lines in array b using the same key as the previous line
    next
}
 in a {print a[]; print b[]}      # if the key matches, print the odd line and the even line
' target ref

使用您显示的示例,请尝试以下 awk 代码。在 GNU awk.

中编写和测试
awk '
FNR==NR{
  if(FNR%2==0){
    arr[prev]=[=10=]
  }
  else{
    prev=[=10=]
  }
  next
}
([=10=] in arr){
  print [=10=] ORS arr[[=10=]]
}
' target ref

解释:为以上添加详细解释。

awk '                     ##Starting awk program from here.
FNR==NR{                  ##Checking condition if FNR==NR which will be TRUE when target file is being read.
  if(FNR%2==0){           ##Checking condition if current line is getting divided completely by 0 then do following.
    arr[prev]=[=11=]          ##Creating arr with index of prev and value is current line.
  }
  else{                   ##Else do following.
    prev=[=11=]               ##Setting prev to current line.
  }
  next                    ##next will skip all further statements from here.
}
([=11=] in arr){              ##If current line is present in arr then do following.
  print [=11=] ORS arr[[=11=]]    ##Printing current line ORS and arr with index of [=11=].
}
' target ref              ##Mentioning Input_file names here.