比较两个文件中的列,如果匹配则更改另一列中的字符串

Compare columns in two files and if match change string in another column

我有两个文件

file1 
non-coding  X   FlyBase gene    20025099    20025170    .   +   .   gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding  X   FlyBase gene    19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding  X   FlyBase gene    476857  479309  .   -   .   gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding  X   FlyBase gene    15576355    15576964    .   +   .   gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding  X   FlyBase gene    19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";

file2
betaNACtes5
CR18275
28SrRNA-Psi:CR45859
CR32821

我想要的:如果 file2 中的任何行与 file1 的第 13 列匹配(部分匹配,因为“”)我想将第 4 列中的字符串更改为“pseudogene”,否则什么都不应该完成。

Desired output

non-coding  X   FlyBase gene    20025099    20025170    .   +   .   gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding  X   FlyBase pseudogene  19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding  X   FlyBase gene    476857  479309  .   -   .   gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding  X   FlyBase pseudogene  15576355    15576964    .   +   .   gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding  X   FlyBase pseudogene  19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";

到目前为止我可以得到火柴,但我做不到剩下的。

grep -Ff file2 file1

使用您显示的示例,请尝试以下 awk 代码。这也将保留 Input_file1 中存在的空格。

awk '
BEGIN{ s1="\"" }
FNR==NR{
  arr[s1 [=10=] s1";"]
  next
}
{
  match([=10=],/^([^[:space:]]+[[:space:]]+){3}/)
  firstPart=substr([=10=],RSTART,RLENGTH)
  [=10=]=substr([=10=],RSTART+RLENGTH)
  match([=10=],/^[^ ]+/)
  restPart=substr([=10=],RSTART+RLENGTH)
  print firstPart ($NF in arr?"pseudogene":substr([=10=],RSTART,RLENGTH)) restPart
}
' file2 file1

说明: 为以上添加详细说明。

awk '                                          ##Starting awk program from here.
BEGIN{ s1="\"" }                               ##Setting s1 to " in BEGIN section.
FNR==NR{                                       ##Checking condition FNR==NR which will be TRUE when file2 is being read.
  arr[s1 [=11=] s1";"]                             ##Creating arr array with index of s1 current line s1 semi colon here.
  next                                         ##next will skip all further statements from here.
}
{
  match([=11=],/^([^[:space:]]+[[:space:]]+){3}/)  ##using match function to match 1st 3 fields here.
  firstPart=substr([=11=],RSTART,RLENGTH)          ##Saving matched part into firstPart to be used later on.
  [=11=]=substr([=11=],RSTART+RLENGTH)                 ##Saving rest of the matched line into current line.
  match([=11=],/^[^ ]+/)                           ##matching everything from starting till 1st space in current line to get 4th field and rest of line value here.
  restPart=substr([=11=],RSTART+RLENGTH)           ##Creating restpart variable which has everything after 4th field value here.
  print firstPart ($NF in arr?"pseudogene":substr([=11=],RSTART,RLENGTH)) restPart ##Printing firstPart then pseudogene OR 4th field and restPart as per need.
}
' file2 file1                                  ##Mentioning Input_file names here.

对第三个参数使用 GNU awk 来匹配 () 和 \s/\S shorthand:

$ cat tst.awk
NR==FNR {
    genes["\"""\";"]
    next
}
$NF in genes {
    match([=10=],/((\S+\s+){3})\S+(.*)/,a)
    [=10=] = a[1] "pseudogene" a[3]
}
{ print }

$ awk -f tst.awk file2 file1
non-coding  X   FlyBase gene    20025099    20025170    .   +   .   gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding  X   FlyBase gene    19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding  X   FlyBase pseudogene    476857  479309  .   -   .   gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding  X   FlyBase pseudogene    15576355    15576964    .   +   .   gene_id "FBgn0262163"; gene_symbol "betaNACtes5";

或者,使用任何 POSIX awk:

$ cat tst.awk
NR==FNR {
    genes["\"""\";"]
    next
}
$NF in genes {
    match([=12=],/([^[:space:]]+[[:space:]]+){3}/)
    tail = substr([=12=],RLENGTH+1)
    sub(/[^[:space:]]+/,"",tail)
    [=12=] = substr([=12=],1,RLENGTH) "pseudogene" tail
}
{ print }

$ awk -f tst.awk file2 file1
non-coding  X   FlyBase gene    20025099    20025170    .   +   .   gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding  X   FlyBase gene    19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding  X   FlyBase pseudogene    476857  479309  .   -   .   gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding  X   FlyBase pseudogene    15576355    15576964    .   +   .   gene_id "FBgn0262163"; gene_symbol "betaNACtes5";