如何解析一行可能引用的搜索词,以及如何使用这些词来匹配每行都包含所有这些词的输入行?

How can I parse a line of potentially quoted search terms, and how can I use these terms to match input lines that each contain all of them?

我在 Mac OS X El Capitan 上使用 Bash。

我的主要目标:

  1. 允许用户输入任意数量的搜索词,包括包含引号的搜索词。

  2. 遍历文本文件并向用户显示包含所有搜索词输入匹配项的每一行。

我有一个文本文件,它本质上是各种服务器上文件的索引。例如,

/Volumes/Server1/Resources/Images/this.jpg
/Volumes/Server2/Inventory/docs/that.doc
/Volumes/Server6/Projects/Project 32/the other.pdf
/Volumes/Server6/Projects/Project 32/audio video training.doc

我正在使用

read -r sSearchTerms

让用户输入搜索条件。我基本上只想解析她输入的任何内容,这样我就可以让她根据多个术语进行搜索。我还想允许使用引号来搜索包含空格的术语。

例如,用户可能输入 "Project 32" "audio video" doc

在这种情况下,我想将 3 个术语与我的索引文本文件的每一行进行比较:Project 32,音频视频,doc,构建一个 results.txt 文件,然后我可以轻松地向用户显示该文件。

我需要弄清楚的两个主要问题:

  1. 如何将输入行正确解析为单独的字符串以用于比较目的(确保引号内的任何内容都被视为单个搜索词(并在比较前删除引号))。我在想象使用数组?

    • stringCompare[0]="Project 32"
    • stringCompare[1]="audio video"
    • stringCompare[2]="doc"
  2. 如何正确测试文本文件的每一行以查看它是否包含所有搜索词输入的匹配项。

下面是我将整个输入行作为单个搜索词进行比较的工作代码。

#!/bin/bash

SEARCH_FILE="/Users/User/Desktop/SEARCH_TEST_2.txt"
RESULTS_FILE="results.txt"
# -i ignores case
GREP_OPTS="-i"

echo "PLEASE ENTER YOUR SEARCH:"

# -r treats backslash as a backslash, instead of an escape character.
read -r sSearchTerms

echo " Searching..."

grep $GREP_OPTS "$sSearchTerms" "$SEARCH_FILE">>"$RESULTS_FILE"

echo " All Done! "

# -t = open with default text editor
open -t "$RESULTS_FILE"

osascript -e 'tell application "Terminal" to quit' &
exit

我想在

之后替换所有内容
read -r sSearchTerms

有了这个:

strQuotes='"'
numberOfQuotes=$(grep -o "$strQuotes" <<< "$sSearchTerms" | wc -l)

if [ "$(($numberOfQuotes%2))" != "0" ]
then
    echo "ODD number of quotes"
    # Can't properly parse an odd number of quotes, so Abort!
else
    echo "EVEN number of quotes"
    # We're good to go on quotes, so go ahead and process

    # Create or overwrite the results file
    echo "">"$RESULTS_FILE"

    # CODE HERE to parse input

    # CODE HERE to compare terms to index and build results file
fi

echo " All Done! "
open -t "$RESULTS_FILE"

osascript -e 'tell application "Terminal" to quit' &
exit

为了安全和易于编码,我已经进行测试以确保用户输入偶数个引号(0、2、4、...)。如果没有,我会发一条消息让用户再试一次。

如果这有帮助,我不知道,但它有一个 bash 函数,使用引号分隔参数。

#!/bin/bash
work_on_list()
{
  length=$#
  echo "There are $length items"
  for i in {1..1000}
    do
      if [ "$i" -gt "$length" ]
      then
          break
      else
        item=${!i}
        echo "$i  $item"
      fi
    done
}
work_on_list a b "c d" "e f g h" 
work_on_list
work_on_list "This is the first" "second item" "and now the third"

结果是

There are 4 items
1  a
2  b
3  c d
4  e f g h
There are 0 items
There are 3 items
1  This is the first
2  second item
3  and now the third

查看下面的代码,它逐行读取文件并对每一行运行正则表达式。

 #!/bin/bash
 searchline()
 {
 # The following uses the first argument as a regex
 # pattern to test against the second argument.  You
 # can use more complex regex patterns to test for
 # multiple substrings
 if [[ "" =~ "" ]]
    then
       echo "     String found"
    else
       echo "     String not found"
    fi
 }
 # At this point your read and list the arguments
 args=$#
 echo $args arguments
 echo "list of arguments preceded by script name"
 for i in `seq 0  $args`
    do
       echo "     $i   ${!i}"
    done
 echo "end of arguments"
 # Move the argumnts into the variables
 if [ "$args" -gt 0 ]
    then
       search=""
    else
       search="abc"
       echo "Using default search string of abc"
   fi 
 if [ "$args" -gt 1 ]
    then
       file="";
    else
       file="stdin"
    fi
 # Read from the file or standard input, runing the 
 # function above against each line
 if [ $file = "stdin" ]
    then
       echo "Read from stdin"
       end_of_file=0
       while [[ $end_of_file == 0 ]]
       do
          read -r line
          end_of_file=$?
          echo $line
          searchline $search $line
       done
    else
       echo "Read from $file"
       IFSold=$IFS
       IFS=$'\n'
       for line in `cat $file`
          do
             echo $line
             searchline $search $line
          done
       IFS=$IFSold
    fi

以下是测试文件。它存储在 testiness.txt 中。所以我 运行 使用以下参数集进行编程。

"the" testfiles.txt

"The" testfiles.txt

 This is the first line
 Second line
 Now if the time for all good men to come to the aid of their country
 Ignorance of the law is no excuse
 The quick brown fox jumped over the lazy dog

将可能引用的术语列表拆分为未引用术语的数组:

假设您不需要支持 embedded " instances inside "..."-quoted search terms,您可以使用 xargs将您的搜索词列表拆分为单独的、不带引号的词,因为 xargs 识别双引号和单引号标记:

#!/bin/bash

# Prompt the user for a list of potentially quoted search terms.
read -r -p 'PLEASE ENTER YOUR SEARCH: ' termList

# Split the list of terms into an array of unquoted terms.
IFS=$'\n' read -d '' -ra terms < <(xargs printf '%s\n' <<<"$termList")

使用示例输入 ("Project 32" "audio video" doc),如果您在上述之后 运行 declare -p terms,您将得到:

declare -a terms='([0]="Project 32" [1]="audio video" [2]="doc")'

这表明列表已正确拆分为未加引号的搜索词(元素值周围的 " 不是值本身的一部分,它们只是使用 [= 打印数组内容的产物21=]).


正在搜索每个包含所有多个搜索词的行:

多个搜索词传递给grep只支持disjunctive逻辑:匹配any的任何行 个字词匹配。

因此,您必须推出自己的 conjunctive 逻辑,即仅匹配包含 all 个术语的行。

虽然您可以在循环中调用 grep,但效率很低,因此 awk 是更好的选择:

# Search each line of the input file for ALL terms entered and print only
# matching lines.
awk '
  NR==FNR { terms[++i] = [=12=]; next }
  { for(i in terms) { if (index(tolower([=12=]), terms[i]) == 0) next } print }
' <(printf '%s\n' "${terms[@]}" | tr '[:upper:]' '[:lower:]') file

注:

  • 上面执行 literal 子字符串匹配,因为我假设你不想支持用户输入 正则表达式 作为搜索词。

    • 如果这样做,请使用 !~ 运算符而不是 index() 调用,但请注意 BSD awk 支持的正则表达式方言比 BSD [=22] 支持的功能少=]的。
  • BSD awk,在 macOS 上,有以下限制:

    • 不支持不区分大小写的匹配,因此需要在匹配前将术语(tr '[:upper:]' '[:lower:]')和每个输入行(tolower([=31=]))转换为小写。

    • 即便如此,匹配也只适用于 ASCII 范围 字符,因为 BSD awk 不支持 Unicode。


如果我们把它们放在一起:

#!/bin/bash

# Determine filenames.
# Note: Better not to use all-uppercase variable names in Bash, because
#       they can conflict with special environment and shell variables.
searchFile="$HOME/Desktop/SEARCH_TEST_2.txt"
resultsFile='results.txt'

# Prompt the user for a list of potentially quoted search terms.
read -r -p 'PLEASE ENTER YOUR SEARCH: ' termList

# Split the list of terms into an array of unquoted terms.
IFS=$'\n' read -d '' -ra terms < <(xargs printf '%s\n' <<<"$termList")

echo 'Searching...'

# Search each line of the input file for ALL terms entered and print only
# matching lines.
awk '
  NR==FNR { terms[++i] = [=13=]; next }
  { for(i in terms) { if (index(tolower([=13=]), terms[i]) == 0) next } print }
' <(printf '%s\n' "${terms[@]}" | tr '[:upper:]' '[:lower:]') "$searchFile" >"$resultsFile"

echo "All Done!" 

open -t "$resultsFile"

osascript -e 'tell application "Terminal" to quit' &
exit