如何检测文本文件中大于 n 的 "hollows" （孔，不匹配模式的线）序列？

Question

案例场景：

$ cat Status.txt
1,connected
2,connected
3,connected
4,connected
5,connected
6,connected
7,disconnected
8,disconnected
9,disconnected
10,disconnected
11,disconnected
12,disconnected
13,disconnected
14,connected
15,connected
16,connected
17,disconnected
18,connected
19,connected
20,connected
21,disconnected
22,disconnected
23,disconnected
24,disconnected
25,disconnected
26,disconnected
27,disconnected
28,disconnected
29,disconnected
30,connected

可以看出，有"hollows"，理解为行，序列文件中有"disconnected"值。

事实上，我想要检测这些"holes"，但如果我可以设置一个最小值n，这将很有用] 序列中缺失数字。
即：对于“n=5”，可检测的孔将是 7... 13 部分，因为序列中至少有 5 个 "disconnected" 连续。但是，在这种情况下，不应将缺少的 17 视为可检测的。同样，在第 21 行 whether 得到一个有效的断开连接。

类似于：

$ detector Status.txt -n 5 --pattern connected
7
21

...可以解释为：

- Missing more than 5 "connected" starting at 7.
- Missing more than 5 "connected" starting at 21.

我需要在 Linux shell 上编写脚本，所以我在考虑编写一些循环、解析字符串等，但我觉得就像这可以通过使用 linux shell 工具 和一些更简单的编程来完成一样。有办法吗？

即使像 csvtool 这样的小程序是有效的解决方案，一些更常见的 Linux 命令（如 grep、cut、awk、sed、wc...等）在使用嵌入式设备时对我来说可能是值得的。

Answer 1

#!/usr/bin/env bash
last_connected=0
min_hole_size=${1:-5}  # default to 5, or take an argument from the command line
while IFS=, read -r num state; do
  if [[ $state = connected ]]; then
    if (( (num-last_connected) > (min_hole_size+1) )); then
      echo "Found a hole running from $((last_connected + 1)) to $((num - 1))"
    fi
    last_connected=$num
  fi
done

# Special case: Need to also handle a hole that's still open at EOF.
if [[ $state != connected ]] && (( num - last_connected > min_hole_size )); then
  echo "Found a hole running from $((last_connected + 1)) to $num"
fi

...发出，给定您在标准输入 (./detect-holes <in.txt) 上的文件：

Found a hole running from 7 to 13
Found a hole running from 21 to 29

参见：

BashFAQ #1 - 如何逐行（and/or逐字段）读取文件（数据流、变量）？
The conditional expression -- [[ ]] 语法用于在不使用引号扩展的情况下安全地进行字符串比较。
Arithmetic comparison syntax -- 在所有 POSIX 兼容的 shell 中 $(( )) 有效；也可以在没有扩展副作用的情况下作为 (( )) 作为 bash 扩展名使用。

Answer 2

这是 awk 的完美用例，因为行读取、列拆分和匹配的机制都是内置的。唯一棘手的是获取脚本的命令行参数，但这还算不错:

#!/usr/bin/env bash
awk -v window="" -F, '
BEGIN { if (window=="") {window = 1} }

=="disconnected"{if (consecutive==0){start=NR}; consecutive++}
!="disconnected"{if (consecutive>window){print start}; consecutive=0}

END {if (consecutive>window){print start}}'

window 值作为第一个命令行参数提供；如果省略，则默认为 1，即 "display the start of gaps with at least two consecutive disconnections"。可能可以有一个更好的名字。您可以给它 0 以包括单个断开连接。下面的示例输出。（请注意，我在末尾添加了一系列 2 次断开连接以测试 Charles 提到的失败）。

njv@organon:~/tmp$ ./tst.sh 0 < status.txt # any number of disconnections
7
17
21
31
njv@organon:~/tmp$ ./tst.sh < status.txt # at least 2 disconnections
7
21
31
njv@organon:~/tmp$ ./tst.sh 8 < status.txt # at least 9 disconnections
21

Answer 3

Awk 解决方案：

detector.awk 脚本：

#!/bin/awk -f

BEGIN { FS="," }
 == "disconnected"{ 
    if (f && NR-c==nr) c++; 
    else { f=1; c++; nr=NR } 
}
 == "connected"{ 
    if (f) { 
        if (c > n) { 
            printf "- Missing more than 5 2connected2 starting at %d.\n", nr 
        } 
        f=c=0 
    } 
}

用法：

awk -f detector.awk -v n=5 status.txt

输出：

- Missing more than 5 "connected" starting at 7.
- Missing more than 5 "connected" starting at 21.

如何检测文本文件中大于 n 的 "hollows" （孔，不匹配模式的线）序列？

How can I detect a sequence of "hollows" (holes, lines not matching a pattern) bigger than n in a text file?

linux

shell

text-processing