awk 匹配两个文件中的三列并将匹配行附加到新文件

awk match three columns from two files and append matching lines to a new file

有很多帖子与此相似。花了几个小时来解决这个问题我很绝望,因为它看起来应该很简单。

我有一个这样的文件:

tig00000005 15310   16162   XP_012153921.1  NW_003797090.1  LOC105664333    PREDICTED: elastin-like
tig00000005 23339   23974   XP_012152584.1  NW_003797083.1  LOC100878991    PREDICTED: LOW QUALITY PROTEIN
tig00000005 24600   25138   XP_012143166.1  NW_003797196.1  LOC100881279    PREDICTED: ankyrin-2 isoform X2
tig00000005 2685    4511    XP_012144644.1  NW_003797249.1  LOC105662970    PREDICTED: fibrinogen alpha chain-like isoform X2
tig00000005 28923   29432   XP_012148395.1  NW_003797444.1  LOC100881617    PREDICTED: eukaryotic translation initiation factor 4 gamma 3-like isoform X12
tig00000005 32415   34324   XP_012153921.1  NW_003797090.1  LOC105664333    PREDICTED: elastin-like

第二个文件如下所示:

tig00000005 maker   gene    15310   16162   .   +   .   ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 maker   gene    16764   17237   .   +   .   ID=snap_masked-tig00000005-processed-gene-0.3;Name=snap_masked-tig00000005-processed-gene-0.3
tig00000005 maker   gene    23339   23974   .   +   .   ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 maker   gene    24600   25138   .   -   .   ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10
tig00000005 maker   gene    25472   26900   .   +   .   ID=snap_masked-tig00000005-processed-gene-0.5;Name=snap_masked-tig00000005-processed-gene-0.5

我想将第一个文件中的 1、2 和 3 列与第二个文件中的 1、4 和 5 相匹配,如果它们匹配,则将第二个文件的数据附加到第一个文件中,例如所以:

tig00000005 15310   16162   XP_012153921.1  NW_003797090.1  LOC105664333    PREDICTED: elastin-like tig00000005 maker   gene    15310   16162   .   +   .   ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2

一些无效的示例代码:

awk 'OFS="\t"; FS="\t"; NR==FNR{a[,,]=[=13=]; next} ((,,) in a){print [=13=],a[[=13=]]}'  file 1 file 2

awk 'OFS="\t"; FS="\t"; NR==FNR{a[,,]=(,,)} {print [=13=],a[[=13=]]}' file 1 file 2

首先输出一个文件,文件 1 的每一行后跟(未附加)文件 2,第二个代码抛出与 = 函数相关的错误。我已经尝试了我能想象到的任何排列。感谢您提供的任何帮助

像这样?

awk 'NR==FNR{a[" "" "]=[=10=]; next}; {if(" "" " in a){print a[" "" "],[=10=]}}' file1 file2
tig00000005 15310   16162   XP_012153921.1  NW_003797090.1  LOC105664333    PREDICTED: elastin-like tig00000005 maker   gene    15310   16162   .   +   .   ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 23339   23974   XP_012152584.1  NW_003797083.1  LOC100878991    PREDICTED: LOW QUALITY PROTEIN tig00000005 maker   gene    23339   23974   .   +   .   ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 24600   25138   XP_012143166.1  NW_003797196.1  LOC100881279    PREDICTED: ankyrin-2 isoform X2 tig00000005 maker   gene    24600   25138   .   -   .   ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10

要写入新文件只需执行 awk 'NR==FNR{a[" "" "]=[=11=]; next}; {if(" "" " in a){print a[" "" "],[=11=]}}' file1 file2 > file3

对 OP 的第一个 awk 脚本进行了一些小改动:

# old:

awk 'OFS="\t"; FS="\t"; NR==FNR{a[,,]=[=10=]; next} ((,,) in a){print [=10=],a[[=10=]]}' file1 file2

# new - add BEGIN block, modify print statement:

awk 'BEGIN {FS=OFS="\t"} NR==FNR{a[,,]=[=10=]; next} ((,,) in a){print a[,,],[=10=]}' file1 file2

修改后的awk脚本生成:

tig00000005 15310   16162   XP_012153921.1  NW_003797090.1  LOC105664333    PREDICTED: elastin-like tig00000005 maker   gene    15310   16162   .   +   .   ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 23339   23974   XP_012152584.1  NW_003797083.1  LOC100878991    PREDICTED: LOW QUALITY PROTEIN tig00000005 maker   gene    23339   23974   .   +   .   ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 24600   25138   XP_012143166.1  NW_003797196.1  LOC100881279    PREDICTED: ankyrin-2 isoform X2 tig00000005 maker   gene    24600   25138   .   -   .   ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10