查找文件中所有重复的模式

Find all repeated patterns in a file

我有一个文件,其中包含一组几千个唯一 words/terms。 看起来像:

high school teacher
high school student
library
pencil stand
college professor
college graduate

我需要所有重复模式的列表,所以在这种情况下,我需要以下结果:

high
school
high school
college

unix/vim我们有什么方法可以做到这一点吗?

对要求的补充说明:

问。重复必须在一行上,还是可以分成几行?

问。模式是否都是单词序列(一个或多个单词)

问。行内的间距重要吗?大写?标点符号?

这对我有用(脚本放在文件 script.awk 中):

{
    for (i = 1; i <= NF; i++)
    {
        count[$i]++
        sequence = $i
        for (j = i + 1; j <= NF; j++)
        {
            sequence = sequence " " $j
            count[sequence]++
        }
    }
}
END {
    for (i in count)
    {
        if (count[i] > 1)
           print i
    }
}

'every line' 代码在行中构建单词序列并使用它们来计算序列。 END 块循环遍历序列,打印计数大于 1 的序列(因此重复了单词序列)。

给定(扩展)数据文件(称为 data):

high school teacher
high school student
library
pencil stand
college professor
college graduate
coelacanths are ancient fish
coelacanths are ancient but still alive
coelacanths are ancient and long lived
coelacanths are ancient and can live to be 100 years old
coelacanths are ancient living fossils
coelacanths can live to be ancient
coelacanths are long-lived
coelacanths are slow to mature
coelacanths are denizens of the deep sea
coelacanths can be found off Africa and Indonesia

awk -f script.awk data | sort的输出是:

ancient
ancient and
and
are
are ancient
are ancient and
be
can
can live
can live to
can live to be
coelacanths
coelacanths are
coelacanths are ancient
coelacanths are ancient and
coelacanths can
college
high
high school
live
live to
live to be
school
to
to be

数据carefully有一些较长的最多四个字的重复序列;更长的单词序列将同样有效地被跟踪。