grep 在两个方向上有条件地遍历文件

Question

我有一个由多个 cgi 脚本实例写入的日志文件。我需要提取某些信息，具有以下典型工作流程：

搜索第一次出现的 RequestString
从该日志行中提取 PID
向后搜索第一次出现的 PID<separator>ConnectionString，以确定发起请求的客户端
对 ConnectionString 做一些事情并从 'RequestString'

最好的方法是什么？我正在考虑编写一个 perl 脚本来缓存最后 N 行，然后匹配这些行以执行 3.

有没有更好的方法来做到这一点？就像扩展的正则表达式一样可以做到这一点？

带有行号的示例供参考 -- 不是文件的一部分：

1 date    pid1    ConnectionString1
2 date    pid2    ConnectionString2
3 date    pid3    ConnectionString3
4 date    pid2    SomeOutput2
5 date    pid2    SomeOutput2
6 date    pid4    ConnectionString4
7 date    pid3    SomeOutput3
8 date    pid4    RequestString4
9 date    pid1    SomeOutput1
10 date    pid1    ConnectionString1
11 date    pid1    RequestString1
12 date    pid5    RequestString5

当我 grep 通过这个示例文件时，我希望匹配以下内容：

第 8 行，与第 6 行配对
第 11 行，与第 10 行配对（而不是与第 1 行配对）

具体而言，不应匹配以下内容：

第 12 行，因为未找到与该 pid (pid5) 匹配的 ConnectionString
第 1 行，因为在该 pid 的下一个 RequestString 之前（第 10 行），该 pid 有一个新的 ConnectionString。假设在记录 RequestString)
来自 pid2/pid3 的任何行，因为他们没有记录 RequestString。

我可以想象用 .匹配 \n:((pid\d)\s*(ConnectionString\d))(?!).*\s*RequestString\d 然后使用 </code> 来识别客户端。</p> <p>然而，<code>ConnectionString 比 RequestString 多得多（可能是 1000 到 10000 倍），所以我的直觉是先选择 RequestString 然后回溯。

我想我可以使用 (?<) 进行后视，但是 ConnectionStrings 和 RequestStrings 之间的长度基本上是任意的——这样行得通吗？

Answer 1

大致如下：

#!/bin/bash
# Find and number all RequestStrings, then loop through them
grep -n RequestString file | while IFS=":" read n string; do
   echo $n,$string    # Debug
   head -n $n file | tail -r | grep -m1 Connection
done

输出

4,RequestString 1
6189:Connection
7,RequestString 2
7230:Connection
9,RequestString 3
8280:Connection

使用此输入文件

6189:Connection

RequestString 1
7229:Connection
7230:Connection
RequestString 2
8280:Connection
RequestString 3

注意：我使用tail -r是因为OSX缺少我更喜欢的tac。

grep 在两个方向上有条件地遍历文件

grep through a file conditionally in both directions

regex

grep