使用 awk 将一个文件与两个单独的查找文件进行比较

Compare a file with two separate lookup files using awk

基本上,我想检查 lookup_1 和 lookup_2 中的字符串是否存在于我的 xyz.txt 文件中,然后执行操作并将输出重定向到输出文件。此外,我的代码目前正在替换 lookup_1 中所有出现的字符串,甚至作为子字符串,但我需要它仅在存在全字匹配时才进行替换。 您能否帮助调整代码以实现相同的目的?

代码

awk '
FNR==NR { if ([=10=] in lookups)    
             next                            
          lookups[[=10=]]=[=10=]
          for (i=1;i<=NF;i++) {         
              oldstr=$i
              newstr=""
              while (oldstr) {               
                    len=length(oldstr)
                    newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
                    oldstr=substr(oldstr,4)   
              }
              ndx=index(lookups[[=10=]],$i)   
              lookups[[=10=]]=substr(lookups[[=10=]],1,ndx-1) newstr substr(lookups[[=10=]],ndx+length($i))
          }
          next
        }

        { for (i in lookups) { 
              ndx=index([=10=],i)                
              while (ndx > 0) {                       t
                    [=10=]=substr([=10=],1,ndx-1) lookups[i] substr([=10=],ndx+length(lookups[i]))
                    ndx=index([=10=],i)                    
              }
          }
          print
        }
' lookup_1 xyz.txt > output.txt

lookup_1

ha
achine
skhatw
at
ree
ter
man
dun

lookup_2

United States
CDEXX123X
Institution

xyz.txt

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user ter
[2] [ter] This is a demo file 
Demo file is currently being edited by user skhatw
Internal Machine's Change Request being processed. Approved by user mandeep
Institution code is 'CDEXX123X' where country is United States

当前输出

[1] [h#milton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file 
Demo file is currently being edited by user skh#tw
Internal Ma##i##'s Ch#nge Request being processed. Approved by user m##deep
Institution code is 'CDEXX123X' where country is United States

期望输出

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file 
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##


我们可以对当前代码进行一些更改:

  • cat lookup_1 lookup_2 的结果馈送到 awk 中,这样它看起来就像是 awk 的单个文件(参见新代码的最后一行)
  • 使用单词边界标志(\<\>)构建用于执行替换的正则表达式(参见新代码的第二部分)

新代码:

awk '
        # the FNR==NR block of code remains the same

FNR==NR { if ([=10=] in lookups)
             next
          lookups[[=10=]]=[=10=]
          for (i=1;i<=NF;i++) {
              oldstr=$i
              newstr=""
              while (oldstr) {
                    len=length(oldstr)
                    newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
                    oldstr=substr(oldstr,4)
              }
              ndx=index(lookups[[=10=]],$i)
              lookups[[=10=]]=substr(lookups[[=10=]],1,ndx-1) newstr substr(lookups[[=10=]],ndx+length($i))
          }
          next
        }

        # complete rewrite of the following block to perform replacements based on a regex using word boundaries

        { for (i in lookups) {
              regex= "\<" i "\>"            # build regex
              gsub(regex,lookups[i])          # replace strings that match regex
          }
          print
        }
' <(cat lookup_1 lookup_2) xyz.txt            # combine lookup_1/lookup_2 into a single stream so both files are processed under the FNR==NR block of code

这会生成:

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##

备注:

  • 'boundary' 个字符(\<\>)匹配 non-word 个字符;在 awk 中,单词被定义为数字、字母和下划线的序列;有关详细信息,请参阅 GNU awk - regex operators
  • 所有示例查找值都在 awk 词的定义范围内,因此此新代码可以正常工作
  • 您之前的问题包含不能被视为 awk 'word'(例如,@vanti Finserv Co.11:11 - CapitalMS&CO(NY))的查找值,其中如果这个新代码可能无法替换这些新的查找值
  • 对于包含 non-word 个字符的查找值,您不清楚如何定义 'whole word match',因为您还需要确定 non-word 字符(例如,@)将被视为查找字符串的一部分,而不是被视为单词边界

如果您需要替换包含 (awk) non-word 个字符的查找值,您可以尝试用 \W 替换 word-boundary 个字符,尽管这会导致问题对于 (awk) 'words'.

的查找值

一个可能的解决方法是 运行 每个查找值的双正则表达式匹配集,例如:

awk '
FNR==NR { ... no changes to this block of code ... }

        { for (i in lookups) {
              regex= "\<" i "\>"
              gsub(regex,lookups[i])
              regex= "\W" i "\W"
              gsub(regex,lookups[i])
          }
          print
        }
' <(cat lookup_1 lookup_2) xyz.txt

您需要确定第二个正则表达式是否符合您的 'whole word match' 要求。