使用正则表达式提取 bash 中的文件名

Question

谁能帮我设置一个正则表达式。

我有一个很大的 LaTeX3 TeXDoc 文件。 LaTeX3 TeXDoc 定义了宏 \TestFiles{}，应该使用它来列出文件的名称，它本身应该用作单元测试。您可以在大括号之间命名多个文件。所以 \TestFiles{foo-bar} 和 \TestFiles{foo-bar, bar+baz,foo_bar_baz} 是这个宏的语法正确用例。

我想写一个 bash 脚本，提取所有在 \TestFiles{} 宏中命名的 uni 测试文件，用 pdflatex 编译它们并检查，如果 pdflatex 将能够成功生成输出文件。

我的脚本中有这样的东西：

function get_filenames () {
  ## This regex works but is not sensible enough
  # regex='\TestFiles{(.*)}'
  ## This works also, but is again not precise enough
  regex='\TestFiles{([0-9a-zA-Z+-_, ]*)}'
  ## This should give more than one matching group 
  ## (separated by ", " or ","), but this regex doesn't 
  ## work.  I have no idea why or how to modify, to get 
  ## it working
  
  while read -r line ; do
    if [[ $line =~ $regex ]] ; then
      i=1
        while [ $i -le 3 ]; do
          echo "Match $i: \"${BASH_REMATCH[$i]}\""
          i=$(( i + 1 ))
        done
      echo
    fi
  done < mystyle.dtx
}

这是 DTX 文件的摘录

\TestFiles{foo-bar}

\TestFiles{foo-bar, bar+baz,foo_bar_baz}

（您可以将其存储为 mystyle.dtx，以便重现下一个示例。）

使用上面提到的例子，我的脚本给出了以下结果：

get_filenames
Match 1: "foo-bar"
Match 2: ""
Match 3: ""

Match 1: "foo-bar, bar+baz,foo_bar_baz"
Match 2: ""
Match 3: ""

我无法修改我的 regex 表达式，将最后一个 \TestFiles{foo-bar, bar+baz,foo_bar_baz} 示例的内容拆分为三个匹配结果。

我试过这样的正则表达式 regex='\TestFiles{([[:alnum:]+-_]*)[,]+[ ]*}'。我认为 [:alnum:]+-_]* 会匹配文件名。据我了解正则表达式， (...) 应该形成一个组，然后应该在 bash 数组 BASH_REMATCH[$i].

中列出

[,]+部分应该反映每个文件名必须至少用一个逗号分隔。在文件名之间可能有一些白色的 space，所以像 [[:space:]]* 或至少 [ ]* 这样的东西应该代表这一点。量词 * 表示任何重复，范围从 0 到 ...，而 + 至少应出现一次或多次。

但是如果没有匹配的结果，那个正则表达式根本不起作用。

必须如何定义 regex 才能将每个文件名存储为匹配组？我正在搜索正确的正则表达式，以获得此结果：

get_filenames
Match 1: "foo-bar"
Match 2: ""
Match 3: ""

Match 1: "foo-bar"
Match 2: "bar+baz"
Match 3: "foo_bar_baz"

编辑：在我的现实世界文件中，可能（现在）不止是树测试文件。

提前致谢。

Answer 1

## This should give more than one matching group
regex='\TestFiles{([0-9a-zA-Z+-_, ]*)}'

The element of BASH_REMATCH with index n is the portion of the string matching the nth parenthesized subexpression.

您的正则表达式只有 1 个 “带括号的子表达式” - 这就是为什么一切都以 BASH_REMATCH[1]

结尾的原因

$ regex='\TestFiles{([0-9a-zA-Z+-_, ]*)}'
$ [[ $line =~ $regex ]]
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
    [0]="\TestFiles{foo-bar, bar+baz,foo_bar_baz}" 
    [1]="foo-bar, bar+baz,foo_bar_baz"
)

当您尝试匹配未知数量的文件名时，您必须“动态” 创建您的正则表达式，以便它包含所需数量的组。

$ regex='\TestFiles{([^, }]+)([,}] ?)'
$ [[ $line =~ $regex ]]
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
    [0]="\TestFiles{foo-bar, " 
    [1]="foo-bar" 
    [2]=", "
)

添加另一个组，看看它是否仍然匹配：

$ regex+='([^, }]+)([,}] ?)'
$ [[ $line =~ $regex ]]
$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
    [0]="\TestFiles{foo-bar, bar+baz," 
    [1]="foo-bar" 
    [2]=", " 
    [3]="bar+baz" 
    [4]=","
)

您可以继续循环直到正则表达式不再匹配 - 或者更简单的方法可能是计算行中 , 个字符的数量。

regex='\TestFiles{([^, }]+)([,}] ?)'
line='\TestFiles{foo-bar, bar+baz,foo_bar_baz}'
commas=${line//[!,]}

for ((i=0; i<${#commas}; i++))
do
    regex+='([^, }]+)([,}] ?)'
done

[[ $line =~ $regex ]]

这导致：

$ declare -p BASH_REMATCH
declare -a BASH_REMATCH=(
    [0]="\TestFiles{foo-bar, bar+baz,foo_bar_baz}" 
    [1]="foo-bar" 
    [2]=", " 
    [3]="bar+baz" 
    [4]="," 
    [5]="foo_bar_baz" 
    [6]="}"
)

使用 `IFS`

的替代方法

您可以设置 IFS=', ' 并让 bash 为您进行拆分。

line='\TestFiles{foo-bar, bar+baz,foo_bar_baz}'

[[ $line = \TestFiles{* ]] && {
    # Remove leading '\Testfiles{'
    # Remove trailing }
    line=${line#*{} 
    line=${line%}}

    IFS=', ' read -a filenames <<< "$line"

    declare -p filenames
}

declare -a filenames=([0]="foo-bar" [1]="bar+baz" [2]="foo_bar_baz}}")

Answer 2

我相信这就是您要查找的正则表达式：

(?<=\TestFiles{.*)([\w\d\-\+_]+)[, }]+

你可以看到它的工作，修改它并在下面link解释它的作用：https://regex101.com/r/0W8PBi/1

Answer 3

编辑（没有外部程序，虽然它相当不切实际，并且与恰好三个匹配相关）

function get_filenames () {
    p='([^, }]*) *,? *'
    regex="\TestFiles\{$p$p$p"

    while read -r line ; do
        if [[ $line =~ $regex ]] ; then
            i=1
            while [ $i -le 3 ]; do
                echo "Match $i: \"${BASH_REMATCH[$i]}\""
                i=$(( i + 1 ))
            done
            echo
        fi
    done < mystyle.dtx
}

如果您确实需要为每个“\TestFiles”行准确输出三个文件名（甚至是空文件名），那么代码如下。

function get_filenames () {
    MAX_FILES_CNT=3
    IFS=$'\n'
    for line in $(grep -oP '\TestFiles\{\K[^}]*' < mystyle.dtx); do
        filenames=()
        for filename in $(grep -m $MAX_FILES_CNT -oP "[^, ]+" <<< "$line"); do
            filenames+=("$filename")
        done
        i=0
        while [ $i -lt $MAX_FILES_CNT ]; do
            echo "Match $(($i+1)): \"${filenames[i]}\""
            i=$(( i + 1 ))
        done
        echo ""
    done
    unset IFS
}

Match 1: "foo-bar"

Match 2: ""

Match 3: ""

Match 1: "foo-bar"

Match 2: "bar+baz"

Match 3: "foo_bar_baz"

顺便说一下，BASH_REMATCH 不适合这个任务，因为它只捕获最后一个重赛。看

[[ "asdf" =~ (.)* ]]
echo "${BASH_REMATCH[@]}"

asdf f

另外我建议阅读这个问题https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice

Answer 4

使用 set 和 IFS 将每行拆分为新的位置参数。将 $@ 分配给数组，以便可以通过索引访问元素。使用 $@ 直接尝试此操作会导致 bad substitution 错误。

get-filenames.sh

#!/usr/bin/env bash

get_filenames() {
    local IFS=' {},'
    declare -a names

    while read -r line; do
        set -- $line
        names=($@)
        test "${names[0]}" == '\TestFiles' && {
            for i in {1..3}; do
                printf "Match %i: \"%s\"\n" $i ${names[$i]}
            done
        }
        echo
    done < 'mystyle.dtx'
}

get_filenames

mystyle.dtx

\TestFiles{foo-bar}
\TestFiles{foo-bar, bar+baz,foo_bar_baz}

输出

Match 1: "foo-bar"
Match 2: ""
Match 3: ""

Match 1: "foo-bar"
Match 2: "bar+baz"
Match 3: "foo_bar_baz"

Answer 5

建议一个 awk 脚本来解决一个或多个文件的问题。

get_filenames.awk

/\TestFiles{[^}]*}/ { # handle only lines matching regex filter
  filesCount = split([=10=], fileNamesArr, "\\TestFiles{[ ]*|[ ]*,[ ]*|[ ]*}"); # parse line to array fileNamesArr
  for (i = 2; i < filesCount; i++) { # read elements 2 --> filesCount - 1
    printf("Match %d in %s: \"%s\"\n", i - 1, FILENAME, fileNamesArr[i]); # format print fileNames
  }
  print"";
}

测试文件：输入.1.txt

some text line 1
\TestFiles{foo-bar0}
some text \TestFiles{foo-bar1, bar+baz1, foo_bar_baz1}
some text \TestFiles{foo-bar2 ,bar+baz2 ,foo_bar_baz2 }
some text \TestFiles{ foo-bar3 , bar+baz3 , foo_bar_baz3 } some text
line 4

测试文件：输入.2.txt

    \TestFiles{file10, file11}
text
text \TestFiles{  file20 } some text
text\TestFiles{file30,file31,file32   }text
text

测试`get_filenames.awk`

awk -f get_filenames.awk input.1.txt input.2.txt

Match 1 in input.1.txt: "foo-bar0"

Match 1 in input.1.txt: "foo-bar1"
Match 2 in input.1.txt: "bar+baz1"
Match 3 in input.1.txt: "foo_bar_baz1"

Match 1 in input.1.txt: "foo-bar2"
Match 2 in input.1.txt: "bar+baz2"
Match 3 in input.1.txt: "foo_bar_baz2"

Match 1 in input.1.txt: "foo-bar3"
Match 2 in input.1.txt: "bar+baz3"
Match 3 in input.1.txt: "foo_bar_baz3"

Match 1 in input.2.txt: "file10"
Match 2 in input.2.txt: "file11"

Match 1 in input.2.txt: "file20"

Match 1 in input.2.txt: "file30"
Match 2 in input.2.txt: "file31"
Match 3 in input.2.txt: "file32"

使用正则表达式提取 bash 中的文件名

Extracting filesnames in bash with regex

regex

bash

awk

使用 IFS

get_filenames.awk

测试文件：输入.1.txt

测试文件：输入.2.txt

测试get_filenames.awk

使用 `IFS`

测试`get_filenames.awk`