从 bash 中的输出中删除包含大量可能性的行

Question

仅当 direction 列等于 2.

时，我才尝试过滤基于被叫号码前缀的大 txt 文件（大约 10GB）的行

这是我从管道（来自不同脚本）获取的文件格式

caller_number=34234234324, clear_number=982545345435, direction=1, ...
caller_number=83479234234, clear_number=348347384533, direction=2, ...

当然，这只是一个示例数据，但实际文件包含许多其他列，但我只想根据 direction 过滤 clear_number 列，所以这就足够了。

我想删除不包含前缀列表的行，因此例如在这里我将使用 grep 执行以下操作：

grep -vP 'clear_number=(?!(2207891|22034418|22074450|220201677|220240574|220272183|220722988|220723276|220751152|220774457|220794227|220799141|2202000425|2202000939|2202000967)).*direction=2'

效果很好。唯一的问题是我得到的前缀数量有时在 10K-50K 左右，这是很多前缀，如果我尝试使用 grep 我得到 grep: regular expression is too large.

还有什么办法可以使用 Bash 命令来解决吗？

更新

示例..假设我有以下内容：

caller_number=34234234324,     clear_number=982545345435, direction=1
caller_number=83479234234,     clear_number=348347384533, direction=2
caller_number=2342334324,      clear_number=5555345435,   direction=1
caller_number=034082394234324, clear_number=33335345435,  direction=1
caller_number=83479234234,     clear_number=348347384533, direction=2
caller_number=83479234234,     clear_number=444447384533, direction=2
caller_number=83479234234,     clear_number=64237384533, direction=2

我的 list.txt 包含：

642
3333
534234235

所以它只会 return 行

caller_number=83479234234,     clear_number=64237384533, direction=2

因为清楚的数字以 642 和方向 =2 开头。就我而言，它将超过 10GB 的文本文件和 return 至少 100K 的结果。

另一个更新

对不起，我还有一件事不清楚。我从管道命令中获取行，所以我应该对从以前的命令接收到的输出执行 | awk...。

Answer 1

您可以使用 awk 读入前缀并使用

过滤掉行

... | awk -F'[,=[:space:]]+' 'FNR==NR {hash[[=10=]]; next}  == 2 {for (key in hash) { if (index(, key) == 1) { print; next } }}' list.txt - > outputfile

[,=[:space:]]+ 是匹配一个或多个逗号、等号和空白字符的字段分隔符正则表达式。

FNR==NR {hash[[=13=]]; next} 部分读取带有前缀的 list.txt 的内容，每个前缀单独一行。

== 2要求字段6（方向）等于2。

然后，{for (key in hash) { if (index(, key) == 1) { print; next } }}' 尝试查找作为当前字段 4 前缀的 hash 值，如果找到则打印该行并继续下一行。

Answer 2

您可以将大型正则表达式重构为 sed 脚本。那么唯一的资源限制实际上是您有多少内存可用于 sed。

如果我猜对了您的尝试，解决方案可能类似于

sed -e '/direction=2/!b' \
    -e '/clear_number=2207891/!b' \
    -e '/clear_number=22034418/!b' \
    -e '/clear_number=22074450/!b' \
    -e '/clear_number=220201677/!b' \
    -e '/clear_number=220240574/!b' \
    -e '/clear_number=220272183/!b' \
    -e '/clear_number=220722988/!b' \
    -e '/clear_number=220723276/!b' \
    -e '/clear_number=220751152/!b' \
    -e '/clear_number=220774457/!b' \
    -e '/clear_number=220794227/!b' \
    -e '/clear_number=220799141/!b' \
    -e '/clear_number=2202000425/!b' \
    -e '/clear_number=2202000939/!b' \
    -e '/clear_number=2202000967/!b' \
    -e '/clear_number=/d'

如果一个 clear_number 可以是另一个的前缀，可以稍微更改正则表达式以确保数字后跟单词边界。从你的例子来看，在终止斜杠前添加一个 space 似乎就足够了，尽管一些 sed 版本也支持 \> 或 \b 作为单词边界。

Answer 3

使用您展示的示例，请尝试执行以下操作。由于 OP 更改了样本，因此现在按照那个添加代码。

awk '
FNR==NR{
  arr[[=10=]]
  next
}
match([=10=],/clear_number=[^,]*/){
  val=substr([=10=],RSTART+13,RLENGTH-13)
  for(i in arr){
    if(index(val,i)==1 && $NF=="direction=2,"){
      print
      next
    }
  }
}
' list.txt  Input_file

说明： 为以上添加详细说明。

awk '                  ##Starting awk program from here.
FNR==NR{               ##Checking condition if FNR==NR which will be TRUE when list.txt is being read.
  arr[[=11=]]              ##Creating arr array with index of current line.
  next                 ##next will skip all further statements from here.
}
match([=11=],/clear_number=[^,]*/){  ##Using match to match regex for clear_match till 1st occurrence of comma here.
  val=substr([=11=],RSTART+13,RLENGTH-13)  ##Creating val which has substring of matched regex.
  for(i in arr){       ##Traversing through arr here.
    if(index(val,i)==1 && $NF=="direction=2,"){ ##Checking condition of index AND last field is direction=2 then do following.
      print            ##Printing current line here.
      next             ##next will skip all further statements from here.
    }
  }
}
' list.txt  Input_file ##Mentioning Input_file names here.

Answer 4

更接近您原来的做法 -
（需要明确的是，这种方法对于如此大的数据集可能不是最好的，但文件较小的人可能会受益。）

将您的 list.txt 编辑为模式而不仅仅是前缀字符串。
如果我使用

clear_number=123.*direction=2
clear_number=03408.*direction=2
clear_number=4567890.*direction=2

和

caller_number=34234234321,     clear_number=982545345435, direction=1
caller_number=83479234232,     clear_number=123347384533, direction=2
caller_number=2342334323,      clear_number=5555345435,   direction=1
caller_number=834792394234324, clear_number=03408345435,  direction=1
caller_number=56779234235,     clear_number=348347384533, direction=2
caller_number=83479234236,     clear_number=456789084533, direction=2
caller_number=83479234237,     clear_number=64237384533,  direction=2

然后我明白了：

$: grep -f list.txt x
caller_number=83479234232,     clear_number=123347384533, direction=2
caller_number=83479234236,     clear_number=456789084533, direction=2

所以逆转比赛-

$: grep -vf list.txt x
caller_number=34234234321,     clear_number=982545345435, direction=1
caller_number=2342334323,      clear_number=5555345435,   direction=1
caller_number=834792394234324, clear_number=03408345435,  direction=1
caller_number=56779234235,     clear_number=348347384533, direction=2
caller_number=83479234237,     clear_number=64237384533,  direction=2

从

转换list.txt

642
3333
534234235

到

clear_number=642.*direction=2
clear_number=3333.*direction=2
clear_number=534234235.*direction=2

只需要

$: sed -i.bak 's/^/clear_number=/; s/$/.*direction=2/;' list.txt

这也将进行备份。

Answer 5

您也可以试试这个 awk：

your_command |
awk '
FNR == NR {
   rexp["=" ]
   next
}
 == "direction=2" {
   for (s in rexp)
      if (index(, s)) {
         print
         next
      }
}' list.txt -

caller_number=83479234234,     clear_number=64237384533, direction=2

Answer 6

通过更改内部循环的工作方式，这里有一个更快的解决方案。这也使用来自 and 个答案的代码。

FNR==NR{ arr[[=10=]]; next }

=="direction=2,"{
    val=substr(,14)
    for(i=1; i<length(val); i++)
        if(substr(val,1,i) in arr){
            print
            next
        }
}

内部循环遍历 clear_number 的数字，而不是遍历 arr 中的每个键。因此，不是循环 10K-50K 次，而是只循环到数字的长度，根据给定的样本，最大长度约为 12。
- 第一次，这个循环从头开始有一个字符，下一次从头开始有两个字符，依此类推。
- i<length(val) 而不是 i<=length(val) 因为最后一个字符将是 ,.
=="direction=2," 首先比较（如果不匹配，这将节省所有循环）
match([=19=],/clear_number=[^,]*/) 因为 </code> 已经有这个字符串</li> </ul> <p>将上面的代码保存为<code>script.awk并用作：
```
... | mawk -f script.awk list.txt
```
注意我在上面的代码中也使用了mawk。与 GNU awk 相比，此版本的 awk 功能较少，但性能更好。我检查了版本 1.3.4 的结果，结果与 GNU awk.
相同
如果你没有mawk，那么你可以在上面的命令中使用LC_ALL=C awk代替mawk。有关详细信息，请参阅 What does LC_ALL=C do?。

这是一个示例计时结果（使用 mawk）：
```
$ wc data.txt
500000  1500000 36000000 data.txt
$ wc list.txt
12000 12000 73382 list.txt
```
- 0m57.477s --> anubhava 的解决方案，但使用 index(,s) 而不是 ~ s
- 0m59.975s --> RavinderSingh13 的解决方案，但先与 $NF=="direction=2," 比较
- 1m1.578s --> Wiktor Stribiżew 的解决方案
- 0m0.271s --> 这个解决方案

从 bash 中的输出中删除包含大量可能性的行

remove lines from output in bash that contains a huge amount of possibilities

awk

grep

sed