awk

Question

我正在尝试使用以下 awk 命令突出显示匹配词的每个匹配项，

输入：

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

命令：

python -c 'import this' |  awk -v t="better" ' { gsub(t,"[" "&-" ++a["&"] "]" ,[=11=]); print } '

但 gsub() 似乎无法正常工作。输入总共有 8 个“更好”的匹配项，但对于最后一个，上面的命令打印出“更好的 18”。如何解决这个问题。

Although never is often [better-18] than *right* now.. # wrong should be 8

预期输出：

Beautiful is [better-1] than ugly.
Explicit is [better-2] than implicit.
Simple is [better-3] than complex.
Complex is [better-4] than complicated.
Flat is [better-5] than nested.
Sparse is [better-6] than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is [better-7] than never.
Although never is often [better-8] than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

我想要一个可以容纳更多单词的可扩展解决方案，即来自另一个文本文件的输入，只需扫描一次输入文件。 “单词”应该是完全匹配而不是 partial/substring.

match.txt

better
idea

Answer 1

你不能这样做：

gsub(t,"[" "&-" ++a["&"] "]" ,[=10=])

因为它会进行部分正则表达式匹配而不是完整的字符串匹配，更重要的是，gsub() 的参数在调用 gsub() 之前被评估，所以 ++a["&"]在调用 gsub() 之前递增由文字单字符串 "&" 索引的 a[] 的值。这和你写的完全一样：

foo=(++a["&"]); gsub(t,"[" "&-" foo "]" ,[=11=])

这可能是您想要做的，使用 GNU awk for patsplit():

$ cat tst.awk
NR==FNR {
    words[]
    next
}
{
    n = patsplit([=12=],flds,/[[:alnum:]_]+/,seps)
    out = seps[0]
    for (i=1; i<=n; i++) {
        word = flds[i]
        if ( word in words ) {
             word = "[" word "-" (++cnt[word]) "]"
        }
        out = out word seps[i]
    }
    print out
}

$ awk -f tst.awk match.txt file
The Zen of Python, by Tim Peters

Beautiful is [better-1] than ugly.
Explicit is [better-2] than implicit.
Simple is [better-3] than complex.
Complex is [better-4] than complicated.
Flat is [better-5] than nested.
Sparse is [better-6] than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is [better-7] than never.
Although never is often [better-8] than *right* now.
If the implementation is hard to explain, it's a bad [idea-1].
If the implementation is easy to explain, it may be a good [idea-2].
Namespaces are one honking great [idea-3] -- let's do more of those!

以上假定您对“单词”的定义是字母数字或下划线（即“单词成分”）字符的任意序列 - 如果不是，则只需将 [[:alnum:]_]+ 更改为与您的定义相匹配的任何字符一个“字”。

您可以在任何带有 while (match(..,/[[:alnum:]_]+/)) substr(... 循环的 POSIX awk 中执行相同的操作 - 如果您愿意，可以留作练习。

如果您想处理作为变量赋值传递的 1 个单词，则：

$ cat tst.awk
{
    n = patsplit([=14=],flds,/[[:alnum:]_]+/,seps)
    out = seps[0]
    for (i=1; i<=n; i++) {
        word = flds[i]
        if ( word == t ) {
             word = "[" word "-" (++cnt) "]"
        }
        out = out word seps[i]
    }
    print out
}

$ awk -v t='better' -f tst.awk file
The Zen of Python, by Tim Peters

Beautiful is [better-1] than ugly.
Explicit is [better-2] than implicit.
Simple is [better-3] than complex.
Complex is [better-4] than complicated.
Flat is [better-5] than nested.
Sparse is [better-6] than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is [better-7] than never.
Although never is often [better-8] than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Answer 2

对于多个匹配词，也可以使用Perl

$ perl -pe ' BEGIN { @m= map {chomp;$_} qx(cat match.txt); $t=join("\|",@m) }; \
s/$t/$kv{$&}++;"[$&-$kv{$&}]"/ge ' input.txt
The Zen of Python, by Tim Peters

Beautiful is [better-1] than ugly.
Explicit is [better-2] than implicit.
Simple is [better-3] than complex.
Complex is [better-4] than complicated.
Flat is [better-5] than nested.
Sparse is [better-6] than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is [better-7] than never.
Although never is often [better-8] than *right* now.
If the implementation is hard to explain, it's a bad [idea-1].
If the implementation is easy to explain, it may be a good [idea-2].
Namespaces are one honking great [idea-3] -- let's do more of those!

awk - 对匹配事件进行编号无法正常工作

awk - numbering the match occurrence is not working correctly

unix