Bash: 在具有相同内容的文件中查找功能

Question

我正在尝试解决一些行为如下所示的问题让我们引用情况在目录中，我有一些内容的脚本很少（不管它在做什么）

example1.sh
example2.sh
example3.sh
...等等

一共有50个剧本其中一些脚本包含相同的功能，例如

function foo1
{
    echo "Hello"
}

并且在某些脚本中函数可以命名相同但有其他内容或修改，例如

function foo1
{
    echo "$PWD"
}

或

function foo1
{
    echo "Hello"
    ls -la
}

我必须在这些脚本中找到具有相同名称和相同内容的相同函数例如， foo1 example1.sh 和 example2.sh 中相同或修改的内容 -> 我想要的 foo1 example1.sh 和 example3.sh 中的其他内容 -> 不感兴趣

我的问题是解决这个问题的最佳方法是什么？你怎么看？我的想法是对所有脚本中的内容进行排序，并对重复函数的名称进行 grep。我设法做到了，但这仍然不是我想要的，因为我必须使用此功能检查每个文件并检查其内容……这很让人头疼，因为对于某些功能，有 10 个脚本……

我想知道如何从重复的函数中提取内容，但我不知道该怎么做，你觉得怎么样？或者您有其他建议吗？

预先感谢您的回答！

Answer 1

what is the best idea to solve this problem?

编写一个 shell 语言分词器并实现足以从文件中提取函数定义的语法分析。 shell 实现的来源将是一个灵感。然后建立一个file->function+body的数据库，列出所有具有相同function+body的文件。

对于足够简单的函数，awk 或 perl 或 python 脚本足以涵盖大多数情况。但最好的是完整的 shell 语言分词器。

不要使用function name {。而是使用 name() {。参见 bash obsolete and deprecated syntax。

包含以下文件：

# file1.sh
function foo1
{
    echo "Hello"
}


# file2.sh
function foo1
{
    echo "Hello"
}

# file3.sh
function foo1
{
    echo "$PWD"
}


# file4.sh
function foo1
{
    echo "$PWD"
}

以下脚本：

printf "%s\n" *.sh |
while IFS= read -r file; do
     sed -zE '
           s/(function[[:space:]]+([[:print:]]+)[[:space:]]*\{|(function[[:space:]]+)?([[:print:]]+)[[:space:]]*\([[:space:]]*\)[[:space:]]*\{)([^}]*)}/\x01\n\x02/g;
           /\x01/!d;
           s/[^\x01\x02]*\x01([^\x01\x02]*)\x02[^\x01\x02]*/\n\x00/g
        ' "$file" |
     sed -z 's~^~'"$file"'\x01~';
done |
awk -v RS='[=11=]' -v FS='' '
        {cnt[]++; a[]=a[]" "}
        END{ for (i in cnt) if (cnt[i] > 1) print a[i], i }
'

输出：

 file1.sh file2.sh foo1

    echo "Hello"


 file3.sh file4.sh foo1

    echo "$PWD"

表示在file1.sh和file2.sh中有相同的功能foo1，在file3.sh和file4.sh中有相同的功能foo1。

另请注意，脚本可以做到：

if condition; then
   func() { echo something; }
else
   func() { echo something else; }
fi

真正的分词器也必须考虑到这一点。

Answer 2

创建每个函数内容的消息摘要并将其用作关联数组中的键。添加包含相同函数摘要的文件以查找重复项。

您可能想要规范化函数内容中的 space 并调整正则表达式地址范围。

#!/usr/bin/env bash

# the 1st argument is the function name
func_name=""
func_pattern="^function $func_name[[:blank:]]*$"
shift
declare -A dupe_groups

while read -r func_dgst file; do # collect results in an associative array
    dupe_groups[$func_dgst]+="$file "
done < <( # the remaining arguments are scripts
    for f in "${@}"; do
        if grep --quiet "$func_pattern" "$f"; then
            dgst=$( # use an address range in sed to print function contents
                sed -n "/$func_pattern/,/^}/p" "$f" | \
                # pipe to openssl to create a message digest
                openssl dgst -sha1 )
            echo "$dgst $f"
        fi
    done )

# print the results
for key in "${!dupe_groups[@]}"; do
    echo "$key ${dupe_groups[$key]}"
done

我用您的示例{1..3}.sh 文件进行了测试，为重复函数添加了以下 example4.sh。

example4.sh

function foo1
{
    echo "Hello"
    ls -la
}

function another
{
    echo "there"
}

到运行

./group-func.sh foo1 example1.sh example2.sh example3.sh example4.sh

结果

155853f813e944a7fcc5ae73ee2d959e300d217a example1.sh 
7848af9b8b9d48c5cb643f34b3e5ca26cb5bfbdd example2.sh 
4771de27523a765bb0dbf070691ea1cbae841375 example3.sh example4.sh

Bash: 在具有相同内容的文件中查找功能

Bash: find function in files with the same content

linux

bash

shell

automation

devops