awk 匹配两个文件中的三列并将匹配行附加到新文件
awk match three columns from two files and append matching lines to a new file
有很多帖子与此相似。花了几个小时来解决这个问题我很绝望,因为它看起来应该很简单。
我有一个这样的文件:
tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like
tig00000005 23339 23974 XP_012152584.1 NW_003797083.1 LOC100878991 PREDICTED: LOW QUALITY PROTEIN
tig00000005 24600 25138 XP_012143166.1 NW_003797196.1 LOC100881279 PREDICTED: ankyrin-2 isoform X2
tig00000005 2685 4511 XP_012144644.1 NW_003797249.1 LOC105662970 PREDICTED: fibrinogen alpha chain-like isoform X2
tig00000005 28923 29432 XP_012148395.1 NW_003797444.1 LOC100881617 PREDICTED: eukaryotic translation initiation factor 4 gamma 3-like isoform X12
tig00000005 32415 34324 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like
第二个文件如下所示:
tig00000005 maker gene 15310 16162 . + . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 maker gene 16764 17237 . + . ID=snap_masked-tig00000005-processed-gene-0.3;Name=snap_masked-tig00000005-processed-gene-0.3
tig00000005 maker gene 23339 23974 . + . ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 maker gene 24600 25138 . - . ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10
tig00000005 maker gene 25472 26900 . + . ID=snap_masked-tig00000005-processed-gene-0.5;Name=snap_masked-tig00000005-processed-gene-0.5
我想将第一个文件中的 1、2 和 3 列与第二个文件中的 1、4 和 5 相匹配,如果它们匹配,则将第二个文件的数据附加到第一个文件中,例如所以:
tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like tig00000005 maker gene 15310 16162 . + . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
一些无效的示例代码:
awk 'OFS="\t"; FS="\t"; NR==FNR{a[,,]=[=13=]; next} ((,,) in a){print [=13=],a[[=13=]]}' file 1 file 2
awk 'OFS="\t"; FS="\t"; NR==FNR{a[,,]=(,,)} {print [=13=],a[[=13=]]}' file 1 file 2
首先输出一个文件,文件 1 的每一行后跟(未附加)文件 2,第二个代码抛出与 = 函数相关的错误。我已经尝试了我能想象到的任何排列。感谢您提供的任何帮助
像这样?
awk 'NR==FNR{a[" "" "]=[=10=]; next}; {if(" "" " in a){print a[" "" "],[=10=]}}' file1 file2
tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like tig00000005 maker gene 15310 16162 . + . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 23339 23974 XP_012152584.1 NW_003797083.1 LOC100878991 PREDICTED: LOW QUALITY PROTEIN tig00000005 maker gene 23339 23974 . + . ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 24600 25138 XP_012143166.1 NW_003797196.1 LOC100881279 PREDICTED: ankyrin-2 isoform X2 tig00000005 maker gene 24600 25138 . - . ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10
要写入新文件只需执行 awk 'NR==FNR{a[" "" "]=[=11=]; next}; {if(" "" " in a){print a[" "" "],[=11=]}}' file1 file2 > file3
对 OP 的第一个 awk
脚本进行了一些小改动:
# old:
awk 'OFS="\t"; FS="\t"; NR==FNR{a[,,]=[=10=]; next} ((,,) in a){print [=10=],a[[=10=]]}' file1 file2
# new - add BEGIN block, modify print statement:
awk 'BEGIN {FS=OFS="\t"} NR==FNR{a[,,]=[=10=]; next} ((,,) in a){print a[,,],[=10=]}' file1 file2
修改后的awk
脚本生成:
tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like tig00000005 maker gene 15310 16162 . + . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 23339 23974 XP_012152584.1 NW_003797083.1 LOC100878991 PREDICTED: LOW QUALITY PROTEIN tig00000005 maker gene 23339 23974 . + . ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 24600 25138 XP_012143166.1 NW_003797196.1 LOC100881279 PREDICTED: ankyrin-2 isoform X2 tig00000005 maker gene 24600 25138 . - . ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10
有很多帖子与此相似。花了几个小时来解决这个问题我很绝望,因为它看起来应该很简单。
我有一个这样的文件:
tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like
tig00000005 23339 23974 XP_012152584.1 NW_003797083.1 LOC100878991 PREDICTED: LOW QUALITY PROTEIN
tig00000005 24600 25138 XP_012143166.1 NW_003797196.1 LOC100881279 PREDICTED: ankyrin-2 isoform X2
tig00000005 2685 4511 XP_012144644.1 NW_003797249.1 LOC105662970 PREDICTED: fibrinogen alpha chain-like isoform X2
tig00000005 28923 29432 XP_012148395.1 NW_003797444.1 LOC100881617 PREDICTED: eukaryotic translation initiation factor 4 gamma 3-like isoform X12
tig00000005 32415 34324 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like
第二个文件如下所示:
tig00000005 maker gene 15310 16162 . + . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 maker gene 16764 17237 . + . ID=snap_masked-tig00000005-processed-gene-0.3;Name=snap_masked-tig00000005-processed-gene-0.3
tig00000005 maker gene 23339 23974 . + . ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 maker gene 24600 25138 . - . ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10
tig00000005 maker gene 25472 26900 . + . ID=snap_masked-tig00000005-processed-gene-0.5;Name=snap_masked-tig00000005-processed-gene-0.5
我想将第一个文件中的 1、2 和 3 列与第二个文件中的 1、4 和 5 相匹配,如果它们匹配,则将第二个文件的数据附加到第一个文件中,例如所以:
tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like tig00000005 maker gene 15310 16162 . + . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
一些无效的示例代码:
awk 'OFS="\t"; FS="\t"; NR==FNR{a[,,]=[=13=]; next} ((,,) in a){print [=13=],a[[=13=]]}' file 1 file 2
awk 'OFS="\t"; FS="\t"; NR==FNR{a[,,]=(,,)} {print [=13=],a[[=13=]]}' file 1 file 2
首先输出一个文件,文件 1 的每一行后跟(未附加)文件 2,第二个代码抛出与 = 函数相关的错误。我已经尝试了我能想象到的任何排列。感谢您提供的任何帮助
像这样?
awk 'NR==FNR{a[" "" "]=[=10=]; next}; {if(" "" " in a){print a[" "" "],[=10=]}}' file1 file2
tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like tig00000005 maker gene 15310 16162 . + . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 23339 23974 XP_012152584.1 NW_003797083.1 LOC100878991 PREDICTED: LOW QUALITY PROTEIN tig00000005 maker gene 23339 23974 . + . ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 24600 25138 XP_012143166.1 NW_003797196.1 LOC100881279 PREDICTED: ankyrin-2 isoform X2 tig00000005 maker gene 24600 25138 . - . ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10
要写入新文件只需执行 awk 'NR==FNR{a[" "" "]=[=11=]; next}; {if(" "" " in a){print a[" "" "],[=11=]}}' file1 file2 > file3
对 OP 的第一个 awk
脚本进行了一些小改动:
# old:
awk 'OFS="\t"; FS="\t"; NR==FNR{a[,,]=[=10=]; next} ((,,) in a){print [=10=],a[[=10=]]}' file1 file2
# new - add BEGIN block, modify print statement:
awk 'BEGIN {FS=OFS="\t"} NR==FNR{a[,,]=[=10=]; next} ((,,) in a){print a[,,],[=10=]}' file1 file2
修改后的awk
脚本生成:
tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like tig00000005 maker gene 15310 16162 . + . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 23339 23974 XP_012152584.1 NW_003797083.1 LOC100878991 PREDICTED: LOW QUALITY PROTEIN tig00000005 maker gene 23339 23974 . + . ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 24600 25138 XP_012143166.1 NW_003797196.1 LOC100881279 PREDICTED: ankyrin-2 isoform X2 tig00000005 maker gene 24600 25138 . - . ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10