比较两个文件中的列,如果匹配则更改另一列中的字符串
Compare columns in two files and if match change string in another column
我有两个文件
file1
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase gene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase gene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
file2
betaNACtes5
CR18275
28SrRNA-Psi:CR45859
CR32821
我想要的:如果 file2 中的任何行与 file1 的第 13 列匹配(部分匹配,因为“”)我想将第 4 列中的字符串更改为“pseudogene”,否则什么都不应该完成。
Desired output
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase pseudogene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase gene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase pseudogene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding X FlyBase pseudogene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
到目前为止我可以得到火柴,但我做不到剩下的。
grep -Ff file2 file1
使用您显示的示例,请尝试以下 awk
代码。这也将保留 Input_file1 中存在的空格。
awk '
BEGIN{ s1="\"" }
FNR==NR{
arr[s1 [=10=] s1";"]
next
}
{
match([=10=],/^([^[:space:]]+[[:space:]]+){3}/)
firstPart=substr([=10=],RSTART,RLENGTH)
[=10=]=substr([=10=],RSTART+RLENGTH)
match([=10=],/^[^ ]+/)
restPart=substr([=10=],RSTART+RLENGTH)
print firstPart ($NF in arr?"pseudogene":substr([=10=],RSTART,RLENGTH)) restPart
}
' file2 file1
说明: 为以上添加详细说明。
awk ' ##Starting awk program from here.
BEGIN{ s1="\"" } ##Setting s1 to " in BEGIN section.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when file2 is being read.
arr[s1 [=11=] s1";"] ##Creating arr array with index of s1 current line s1 semi colon here.
next ##next will skip all further statements from here.
}
{
match([=11=],/^([^[:space:]]+[[:space:]]+){3}/) ##using match function to match 1st 3 fields here.
firstPart=substr([=11=],RSTART,RLENGTH) ##Saving matched part into firstPart to be used later on.
[=11=]=substr([=11=],RSTART+RLENGTH) ##Saving rest of the matched line into current line.
match([=11=],/^[^ ]+/) ##matching everything from starting till 1st space in current line to get 4th field and rest of line value here.
restPart=substr([=11=],RSTART+RLENGTH) ##Creating restpart variable which has everything after 4th field value here.
print firstPart ($NF in arr?"pseudogene":substr([=11=],RSTART,RLENGTH)) restPart ##Printing firstPart then pseudogene OR 4th field and restPart as per need.
}
' file2 file1 ##Mentioning Input_file names here.
对第三个参数使用 GNU awk 来匹配 () 和 \s/\S
shorthand:
$ cat tst.awk
NR==FNR {
genes["\"""\";"]
next
}
$NF in genes {
match([=10=],/((\S+\s+){3})\S+(.*)/,a)
[=10=] = a[1] "pseudogene" a[3]
}
{ print }
$ awk -f tst.awk file2 file1
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase pseudogene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase pseudogene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
或者,使用任何 POSIX awk:
$ cat tst.awk
NR==FNR {
genes["\"""\";"]
next
}
$NF in genes {
match([=12=],/([^[:space:]]+[[:space:]]+){3}/)
tail = substr([=12=],RLENGTH+1)
sub(/[^[:space:]]+/,"",tail)
[=12=] = substr([=12=],1,RLENGTH) "pseudogene" tail
}
{ print }
$ awk -f tst.awk file2 file1
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase pseudogene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase pseudogene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
我有两个文件
file1
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase gene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase gene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
file2
betaNACtes5
CR18275
28SrRNA-Psi:CR45859
CR32821
我想要的:如果 file2 中的任何行与 file1 的第 13 列匹配(部分匹配,因为“”)我想将第 4 列中的字符串更改为“pseudogene”,否则什么都不应该完成。
Desired output
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase pseudogene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase gene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase pseudogene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding X FlyBase pseudogene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
到目前为止我可以得到火柴,但我做不到剩下的。
grep -Ff file2 file1
使用您显示的示例,请尝试以下 awk
代码。这也将保留 Input_file1 中存在的空格。
awk '
BEGIN{ s1="\"" }
FNR==NR{
arr[s1 [=10=] s1";"]
next
}
{
match([=10=],/^([^[:space:]]+[[:space:]]+){3}/)
firstPart=substr([=10=],RSTART,RLENGTH)
[=10=]=substr([=10=],RSTART+RLENGTH)
match([=10=],/^[^ ]+/)
restPart=substr([=10=],RSTART+RLENGTH)
print firstPart ($NF in arr?"pseudogene":substr([=10=],RSTART,RLENGTH)) restPart
}
' file2 file1
说明: 为以上添加详细说明。
awk ' ##Starting awk program from here.
BEGIN{ s1="\"" } ##Setting s1 to " in BEGIN section.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when file2 is being read.
arr[s1 [=11=] s1";"] ##Creating arr array with index of s1 current line s1 semi colon here.
next ##next will skip all further statements from here.
}
{
match([=11=],/^([^[:space:]]+[[:space:]]+){3}/) ##using match function to match 1st 3 fields here.
firstPart=substr([=11=],RSTART,RLENGTH) ##Saving matched part into firstPart to be used later on.
[=11=]=substr([=11=],RSTART+RLENGTH) ##Saving rest of the matched line into current line.
match([=11=],/^[^ ]+/) ##matching everything from starting till 1st space in current line to get 4th field and rest of line value here.
restPart=substr([=11=],RSTART+RLENGTH) ##Creating restpart variable which has everything after 4th field value here.
print firstPart ($NF in arr?"pseudogene":substr([=11=],RSTART,RLENGTH)) restPart ##Printing firstPart then pseudogene OR 4th field and restPart as per need.
}
' file2 file1 ##Mentioning Input_file names here.
对第三个参数使用 GNU awk 来匹配 () 和 \s/\S
shorthand:
$ cat tst.awk
NR==FNR {
genes["\"""\";"]
next
}
$NF in genes {
match([=10=],/((\S+\s+){3})\S+(.*)/,a)
[=10=] = a[1] "pseudogene" a[3]
}
{ print }
$ awk -f tst.awk file2 file1
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase pseudogene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase pseudogene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
或者,使用任何 POSIX awk:
$ cat tst.awk
NR==FNR {
genes["\"""\";"]
next
}
$NF in genes {
match([=12=],/([^[:space:]]+[[:space:]]+){3}/)
tail = substr([=12=],RLENGTH+1)
sub(/[^[:space:]]+/,"",tail)
[=12=] = substr([=12=],1,RLENGTH) "pseudogene" tail
}
{ print }
$ awk -f tst.awk file2 file1
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase pseudogene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase pseudogene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";