如何一次遍历两行文本？

Question

我有一个文本文件，我想查看它并列出每次连续出现两个单词时的每次计数。例如，我想要的输出如下所示：

示例输入：

I am a man

期望的输出：

1 I am
1 am a
1 a man

我是这么想的：

cat  | sed "s/ /\n/g" | read  word1 && 
    while read word2;
    do
        echo "$word1    $word2";
        word1=word2;
    done

虽然这会进入无限循环。感谢任何帮助！

Answer 1

在while条件下调用read两次

while read line1; read line2; do
    echo "$line1 $line2"
done <<EOF
1
a
2
b
EOF

会输出

1 a
2 b

当第二个 read 失败时循环退出，即使第一个成功。如果你想执行循环体（即使 line2 是空的），将 read line2 移动到循环体中。

Answer 2

与bash:

set -f         # for slurping in the words of the file, we want word splitting
               # but not glob expansion
words=( $(< "") )

for ((i = 1; i < ${#words[@]}; i++)); do
  printf "%s %s\n" "${words[i-1]}" "${words[i]}"
done

给定@chepner 的输入文件，输出

1 a
a 2
2 b

重写您的代码：您需要一个分组结构，以便所有 read 都从同一数据管道读取。

tr -s '[:space:]' '\n' < "" | {
  IFS= read -r word1
  while IFS= read -r word2; do 
    echo "$word1 $word2"
    word1=$word2
  done
}

对于计数，最简单的方法是将输出通过管道传输到 | sort | uniq -c
使用来自@markp-fuso 的 words.dat 文件，这两种解决方案的输出都是

      3 I am
      3 a man
      2 am a
      1 am not
      2 man I
      1 not a

可以使用关联数组在 bash 中完成计数：

declare -A pairs

for ((i = 1; i < ${#words[@]}; i++)); do
  key="${words[i-1]} ${words[i]}"
  pairs[$key]=$(( pairs[$key] + 1 ))
done

for key in "${!pairs[@]}"; do
  printf "%7d %s\n" "${pairs[$key]}" "$key"
done

      1 not a
      3 a man
      1 am not
      2 am a
      3 I am
      2 man I

Answer 3

假设：

计数在整个文件中累积（与重新开始每个新行的计数相反）
单词对可以跨行，例如，one\nword 等同于 one word
我们只对 2 词配对感兴趣，即不需要为动态数量的词（例如，3 词、4 词）编码

示例输入数据：

$ cat words.dat
I am a man
I am not a man I
am a man

一个awk想法：

$ awk -v RS='' '                       # treat file as one loooong single record
    { for (i=1;i<NF;i++)               # loop through list of fields 1 - (NF-1)
          count[$(i)" "$(i+1)]++       # use field i and i+1 as array index
    }
END { for (i in count)                 # loop through array indices
          print count[i],i
    }
' words.dat

这会生成：

2 am a
3 a man
1 am not
3 I am
1 not a
2 man I

注意： 没有说明排序要求，否则我们可以将结果通过管道传递给 sort，或者如果使用 GNU awk，我们可以添加一个适当的PROCINFO["sorted_in"]声明

OP的原始输入：

$ awk -v RS='' '{for (i=1;i<NF;i++) count[$(i)" "$(i+1)]++} END {for (i in count) print count[i],i}' <<< "I am a man"
1 am a
1 a man
1 I am

删除关于动态字数的假设...

$ awk -v wcnt=2 -v RS='' '                  # <word_count> = 2; treat file as one loooong single record
NF>=wcnt { for (i=1;i<=(NF-wcnt+1);i++) {   # loop through list of fields 1 - (NF-<word_count>)
               pfx=key=""
               for (j=0;j<wcnt;j++) {       # build count[] index from <word_count> fields
                   key=key pfx $(j+i)
                   pfx=" "
               }
               count[key]++
           }
         } 

END      { for (i in count)                 # loop through array indices
               print count[i],i
         }
' words.dat

与-v wcnt=2:

2 am a
3 a man
1 am not
3 I am
1 not a
2 man I

与-v wcnt=3:

1 not a man
2 I am a
1 I am not
2 man I am
2 am a man
2 a man I
1 am not a

与-v wcnt=5:

1 I am a man I
1 I am not a man
1 am not a man I
1 am a man I am
1 man I am a man
1 man I am not a
1 a man I am not
1 not a man I am
1 a man I am a

与 -v wcnt=3 和 awk '...' <<< "I am a man":

1 I am a
1 am a man

与 -v wcnt=5 和 awk '...' <<< "I am a man":

# no output since less than wcnt=5 words to work with

如何一次遍历两行文本？

How can I iterate through text two lines at a time?

unix

bash