如何一次遍历两行文本?

How can I iterate through text two lines at a time?

我有一个文本文件,我想查看它并列出每次连续出现两个单词时的每次计数。例如,我想要的输出如下所示:

示例输入:

I am a man

期望的输出:

1 I am
1 am a
1 a man

我是这么想的:

cat  | sed "s/ /\n/g" | read  word1 && 
    while read word2;
    do
        echo "$word1    $word2";
        word1=word2;
    done

虽然这会进入无限循环。感谢任何帮助!

while条件下调用read两次

while read line1; read line2; do
    echo "$line1 $line2"
done <<EOF
1
a
2
b
EOF

会输出

1 a
2 b

当第二个 read 失败时循环退出,即使第一个成功。如果你想执行循环体(即使 line2 是空的),将 read line2 移动到循环体中。

与bash:

set -f         # for slurping in the words of the file, we want word splitting
               # but not glob expansion
words=( $(< "") )

for ((i = 1; i < ${#words[@]}; i++)); do
  printf "%s %s\n" "${words[i-1]}" "${words[i]}"
done

给定@chepner 的输入文件,输出

1 a
a 2
2 b

重写您的代码:您需要一个分组结构,以便所有 read 都从同一数据管道读取。

tr -s '[:space:]' '\n' < "" | {
  IFS= read -r word1
  while IFS= read -r word2; do 
    echo "$word1 $word2"
    word1=$word2
  done
}

对于计数,最简单的方法是将输出通过管道传输到 | sort | uniq -c
使用来自@markp-fuso 的 words.dat 文件,这两种解决方案的输出都是

      3 I am
      3 a man
      2 am a
      1 am not
      2 man I
      1 not a

可以使用关联数组在 bash 中完成计数:

declare -A pairs

for ((i = 1; i < ${#words[@]}; i++)); do
  key="${words[i-1]} ${words[i]}"
  pairs[$key]=$(( pairs[$key] + 1 ))
done

for key in "${!pairs[@]}"; do
  printf "%7d %s\n" "${pairs[$key]}" "$key"
done
      1 not a
      3 a man
      1 am not
      2 am a
      3 I am
      2 man I

假设:

  • 计数在整个文件中累积(与重新开始每个新行的计数相反)
  • 单词对可以跨行,例如,one\nword 等同于 one word
  • 我们只对 2 词配对感兴趣,即不需要为动态数量的词(例如,3 词、4 词)编码

示例输入数据:

$ cat words.dat
I am a man
I am not a man I
am a man

一个awk想法:

$ awk -v RS='' '                       # treat file as one loooong single record
    { for (i=1;i<NF;i++)               # loop through list of fields 1 - (NF-1)
          count[$(i)" "$(i+1)]++       # use field i and i+1 as array index
    }
END { for (i in count)                 # loop through array indices
          print count[i],i
    }
' words.dat

这会生成:

2 am a
3 a man
1 am not
3 I am
1 not a
2 man I

注意: 没有说明排序要求,否则我们可以将结果通过管道传递给 sort,或者如果使用 GNU awk,我们可以添加一个适当的PROCINFO["sorted_in"]声明

OP的原始输入:

$ awk -v RS='' '{for (i=1;i<NF;i++) count[$(i)" "$(i+1)]++} END {for (i in count) print count[i],i}' <<< "I am a man"
1 am a
1 a man
1 I am

删除关于动态字数的假设...

$ awk -v wcnt=2 -v RS='' '                  # <word_count> = 2; treat file as one loooong single record
NF>=wcnt { for (i=1;i<=(NF-wcnt+1);i++) {   # loop through list of fields 1 - (NF-<word_count>)
               pfx=key=""
               for (j=0;j<wcnt;j++) {       # build count[] index from <word_count> fields
                   key=key pfx $(j+i)
                   pfx=" "
               }
               count[key]++
           }
         } 

END      { for (i in count)                 # loop through array indices
               print count[i],i
         }
' words.dat

-v wcnt=2:

2 am a
3 a man
1 am not
3 I am
1 not a
2 man I

-v wcnt=3:

1 not a man
2 I am a
1 I am not
2 man I am
2 am a man
2 a man I
1 am not a

-v wcnt=5:

1 I am a man I
1 I am not a man
1 am not a man I
1 am a man I am
1 man I am a man
1 man I am not a
1 a man I am not
1 not a man I am
1 a man I am a

-v wcnt=3awk '...' <<< "I am a man":

1 I am a
1 am a man

-v wcnt=5awk '...' <<< "I am a man":

# no output since less than wcnt=5 words to work with