你如何使用awk找到文本中最多的两个连续单词?

How do you find the most two consecutive words in a text using awk?

如果你的文字像

Reservoir 1992 reviewed by Reservoir Har even RESERVOIR DOGS

你要做的第一件事就是把所有的单词放在一栏中,

tr -s '[[:punct:][:space:]]' '\n'


Reservoir 
1992    
reviewed
by         
Reservoir 
Har       
even      
RESERVOIR  
DOGS

然后你必须使用

合并每两个连续的行
awk 'NR == 1 { prev = [=11=]; next }
           { print prev, [=11=]; prev = [=11=] }' 

输出:

Reservoir  1992
1992    reviewed
reviewed   by
by    Michael
Reservoir  Har
Ha   even
even    RESERVOIR
RESERVOIR  DOGS

你可以用printf代替print,这样输出吗? (往下看答案)

Reservoir  1992
1992       reviewed
reviewed   by
by         Michael
Reservoir  Har
Har        even
even       RESERVOIR
RESERVOIR  DOGS

然后你 -sort 然后 uniq -c 然后 sort -nr

你很接近:

awk 'FNR==1{prev=; next}
{printf "%s\t%s\n", prev, ; prev=}' file

生成您声明的词序输出

这个:

awk 'FNR==1{prev=; next}
{printf "%s\t%s\n", prev, ; prev=}' | column -t
Reservoir  1992
1992       reviewed
reviewed   by
by         Reservoir
Reservoir  Har
Har        even
even       RESERVOIR
RESERVOIR  DOGS

生成输出格式。注意使列宽均匀的间距是可变的。要在 awk 中生成它,您通常需要遍历文件两次以设置列的宽度。 unix 实用程序 column 会为您完成。

如果您希望 awk 完成所有工作,您可以按照以下方式做一些事情:

awk 'FNR==NR{length()>max ? max=length() : max=max; next}
FNR==1{prev=; next}
{printf "%-*s\t%s\n", max,prev,; prev=}' file file
Reservoir   1992
1992        reviewed
reviewed    by
by          Reservoir
Reservoir   Har
Har         even
even        RESERVOIR
RESERVOIR   DOGS