Bash: 存储替换的子串

Question

我正在通过一个很长的管道通过 bash 脚本处理文本文件，并且在一个步骤中需要：

删除一些正则表达式匹配的子字符串
将它们写入文件
然后继续阅读其余的文本。

我可以使用任何可以在管道中使用的东西。 simplest/fastest 方法是什么？

更新： 示例：

echo -e " apple pears banana \n kiwi ananas cocoa" | magic_script " [ab][a-z]+" removed.txt | cat

输出：

pears kiwi cocoa

removed.txt:

apple banana ananas

应该用什么代替 magic_script " [ab][a-z]+" removed.txt？它应该适用于任何文本和任何正则表达式。

更新2：

对于其他示例，如果正则表达式是 /a.{2,3}/ :

输出：类似于 sed -E "s/a.{2,3}//g

的结果

e peba kiwi ocoa

removed.txt：类似于 grep -Eo "a.{2,3}"

的结果

appl ars anan anan as c

Answer 1

AWK 可用于此目的。

见https://www.gnu.org/software/gawk/manual/html_node/Redirection.html 其中包含以下概念示例：

$ awk '{ print  > "phone-list"
>        print  > "name-list" }' mail-list
$ cat phone-list
-| 555-5553
-| 555-3412
…
$ cat name-list
-| Amelia
-| Anthony
…

其中邮件列表包含两列信息：第一列包含姓名，第二列包含 phone 号码。

查看 match(string,regex) 函数 (http://www.grymoire.com/Unix/Awk.html#uh-47) for capturing regular expressions, keeping in mind that [=21=] designates the entire line read in. This function returns RSTART and RLENGTH, which can be used with the substr(string,position,length) (http://www.grymoire.com/Unix/Awk.html#uh-43) return 匹配模式的函数（如果您按行搜索，则字符串=$0）。

这里是对 AWK 的精彩介绍：http://www.grymoire.com/Unix/Awk.html ...可能看起来很长但值得投资。

更新

如果您实际上要处理包含信息字段的多行，并且您并不特别关心找到的项目是否以相同的柱状形式打印，那么以下方法可行：

echo -e " apple pears banana \n kiwi ananas cocoa\n pork" | 
awk '{
  #printf "\n"
  for(j=1;j<=NF;j++){
    i=match($j,/[ab][a-z]+/)
    if(i>0){
      print $j > "removed.txt"
    }else{
      printf $j " "
    }
  }
}'

如果您确实关心保留柱状形式，那么您可以使用上面注释掉的 printf 函数并稍微修改一下以使其恰到好处（并将第二个 print 替换为 printf $j " ").但是，由于 AWK 处理字段，如果您在要捕获的单个字段（即之间没有分隔符）中有多个模式实例，则上述方法会导致问题。

更新 2

这是一个更好的解决方案，可以确保找到所有匹配项并且与字段无关：

echo -e " apple pears banana \n kiwi ananas cocoa" |
awk '
BEGIN {
  regex="a.{2,3}";
}
{
  ibeg=1;
  imat=match(substr([=12=],ibeg),regex);
  after=[=12=];
  while (imat) {
    before = substr([=12=],ibeg,RSTART-1);
    pattern = substr([=12=],ibeg+RSTART-1,RLENGTH);
    after = substr([=12=],ibeg+RSTART+RLENGTH-1);
    printf before;
    print pattern >"removed.txt";
    ibeg=ibeg+RSTART+RLENGTH-1;
    imat=match(substr([=12=],ibeg),regex);
  }
  print after;
}
'

输出：

e peba
kiwi ocoa

已删除：

$ cat removed.txt
appl
ars
anan
anan
as c

Answer 2

这里有一个解决方案，除了删除的内容外，其他行都保持完好无损：

$ echo -e "apple pears banana \n kiwi ananas cocoa" \
| awk '{ for (i=1;i<=NF;++i) { if ($i ~ /^[ab][a-z]+/) { print $i > "removed.txt"; $i=""}} print }'
 pears 
kiwi  cocoa

$ cat removed.txt 
apple
banana
ananas

Answer 3

用 sed 可以做到这一点，但由于正则表达式和文件名不固定，而且 sed 不能很好地处理 shell 变量，awk 是完成这项工作的更好工具。我们想要运行的 awk 代码可能如下所示：

{
  head = ""
  tail = [=10=]

  while(match(tail, re)) {                     # while there's a match in the
                                               # part of the line we haven't
                                               # yet inspected
    print substr(tail, RSTART, RLENGTH) > file # print the match to the
                                               # file
    head = head substr(tail, 1, RSTART - 1)    # split off the parts before
    tail = substr(tail, RSTART + RLENGTH)      # and after the match
  }
  print head tail                              # print what's left in the end
}

具有合适的参数 re 和 file。 感谢 @EdMorton 指出了原始代码的问题并提出了此修改建议。

为了按照您在问题中提出的方式使其可调用，让我们在其周围放置一些 shell 样板文件：

#!/bin/sh

if [ $# -ne 2 ]; then
    echo "Usage: [=11=] regex filename"
    exit 1
fi

awk -v re="" -v file="" '
{
  head = ""
  tail = [=11=]

  while(match(tail, re)) {
    print substr(tail, RSTART, RLENGTH) > file
    head = head substr(tail, 1, RSTART - 1)
    tail = substr(tail, RSTART + RLENGTH)
  }
  print head tail
}'

将其放入文件 magic_script、chmod +x 中，然后就可以了。当然你也可以直接调用 awk as

awk -v re=' [ab][a-z]+' -v file=removed.txt '{ head = ""; tail = [=12=]; while(match(tail, re)) { print substr(tail, RSTART, RLENGTH) > file; head = head substr(tail, 1, RSTART - 1); tail = substr(tail, RSTART + RLENGTH); } print head tail }'

Answer 4

对第 4 个参数使用 GNU awk 进行 split():

$ cat tst.awk
{
    split([=10=],flds,re,seps)
    for (i=1;i in flds;i++) {
        printf "%s", flds[i]
        if (i in seps)
            print seps[i] > "removed.txt"
    }
    print ""
}

$ echo -e " apple pears banana \n kiwi ananas cocoa" | awk -v re=' [ab][a-z]+' -f tst.awk
 pears
 kiwi cocoa

$ cat removed.txt
 apple
 banana
 ananas

Bash: 存储替换的子串

Bash: store replaced substrings

python

bash

awk

replace

sed