我怎样才能加快这个非常慢的 Shell 脚本,用于对具有/不确定性的数据进行制表?
How Can I Speed Up This Very Slow Shell Script for Tabulation of Data w/ Uncertainty?
我正在处理通常具有平均值和不确定性的大型数据集。通常在发布时只显示一位不确定性数字,并将相应的值四舍五入到该小数位。然后将不确定性括在括号中并附加到缩短的平均字符串。
例如:
Avg: 101.0513213 SD: 0.33129
...会给出:
101.1(3)
在实践中,这听起来很简单,但实际上会变得有些复杂,因为您必须首先计算一位数的标准偏差,然后使用它来确定要将平均值四舍五入到的小数位数。在舍入到 10
的情况下添加(即 0.094
舍入到 0.09
,但 0.095
舍入到 0.1
更改要舍入到的数字)和事实你是四舍五入而不是截断,原则上实现起来有点麻烦。
我有一组 BASH Script
函数,它们使用 printf
、bc
、sed
、echo
调用的组合来完成它。它有效,但计算结果非常慢。这是您可以自己尝试的示例。你应该能够看到它有多慢:
#!/bin/bash
function CleanBC() {
echo "${1/[eE][+][0]/*10^}" | bc -l | \
sed -e 's#^\(-*\)\.#\.#g'
}
function Log10Float() {
echo $( CleanBC "l()/l(10)" )
}
function TruncateDecimal() {
echo | sed -e 's#\.[0-9]*##g'
}
function PowerOf10() {
absPow=$( echo | sed 's#-##g' )
if [[ $( CleanBC "==0" ) -eq '1' ]]; then
echo "1"
elif [[ $( CleanBC ">0" ) -eq '1' ]]; then
echo "1"$(printf '0%.0s' $( seq 1 $absPow ) )
elif [[ $( CleanBC "==-1" ) -eq '1' ]]; then
echo "0.1"
elif [[ $( CleanBC "<-1" ) -eq '1' ]]; then
echo "0."$(printf '0%.0s' $( seq 2 $absPow ) )"1"
fi
}
function RoundArbitraryDigit() {
pow=$( PowerOf10 )
offset=$( CleanBC "if (>=0) {0.5} else {-0.5}" )
absPow=$( echo | sed -e 's#-##g' )
invPow=$( PowerOf10 $( CleanBC "*-1" ) )
shiftedVal=$( TruncateDecimal $( CleanBC "$invPow*+$offset" ) )
val=$( CleanBC "scale=15;$shiftedVal*$pow" )
echo $( printf "%.${absPow}f" $val )
}
function Flt2Int() {
RoundArbitraryDigit 0
}
function Round() {
for v in ${@:3}; do
div=$( CleanBC "$v / " )
case $( echo | tr '[:lower:]' '[:upper:]' ) in
CLOSEST)
val=$( TruncateDecimal $( Flt2Int $div ) );
;;
UP)
val=$( TruncateDecimal $div );
((val++))
;;
DOWN)
val=$( TruncateDecimal $div );
;;
esac
echo $( CleanBC "$val * " )
done
}
function Tabulate() {
roundTo=$( Log10Float )
roundTo=$( CleanBC "if ($roundTo < 0) {$roundTo -1} else {$roundTo}" )
roundTo=$( TruncateDecimal $roundTo )
roundedSD=$( RoundArbitraryDigit $roundTo )
invPow=$( PowerOf10 $( CleanBC "$roundTo*-1" ) )
if [[ $( CleanBC "($invPow * $roundedSD) == 10" ) -eq '1' ]]; then
((roundTo++))
roundedSD=$( RoundArbitraryDigit $roundedSD $roundTo )
invPow=$( PowerOf10 $( CleanBC "$roundTo*-1" ) )
fi
intSD=$( CleanBC "($invPow * $roundedSD)" | sed -e 's#\.[0-9]*##g' )
pow=$( PowerOf10 $roundTo )
intSD=$( CleanBC "if ($pow > 10 ) {$pow*$intSD} else {$intSD}" )
val="$( RoundArbitraryDigit $roundTo )"
if [[ $( CleanBC "$roundTo > -1" ) -eq '1' ]]; then
val=$( echo $val | sed -e 's#\.0*$##g' )
fi
echo "$val(${intSD})"
}
Tabulate '-.9782000' '0.0051335'
Tabulate '105.843516' '8.7571141'
Tabulate '0.2581699' '0.0020283'
Tabulate '3.4368211' '0.0739912'
我的第一个问题是特定函数或代码块是否比其他函数或代码块更能减慢整体计算速度。
其次,我想就如何在考虑第一个答案的情况下提高整体代码的速度提出建议。
第三,作为一个更笼统的问题,我想知道在这种情况下可以使用哪些工具来分析 bash 脚本并识别瓶颈。
(注意:使用CleanBC
函数是因为有时用例中的其他相关函数会生成科学计数法数字,即2.41321E+05
等。因此需要此函数来保持bc
失败——我的用例中的附加要求。)
感谢@Gordon Davisson 的建议,我改进了脚本。为了更好地计时并检测另一个边缘案例,我将旧脚本的结束行修改为:
function test() {
Tabulate '-.9782000' '0.0051335'
Tabulate '105.843516' '8.7571141'
Tabulate '1055.843516' '85.7571141'
Tabulate '0.2581699' '0.0020283'
Tabulate '3.4368211' '0.0739912'
}
time test
使用旧脚本我得到:
real 0m12.627s
user 0m3.150s
sys 0m9.282s
新脚本是:
#!/bin/bash
function CleanBC() {
val=$(bc -l <<< "${1/*[eE][+][0]/*10^}")
if [[ $val == -* ]]; then
echo "${val/#-./-0.}"
else
echo "${val/#./0.}"
fi
}
function Log10Float() {
CleanBC "l()/l(10)"
}
function TruncateDecimal() {
echo ${1/.*/}
}
function PowerOf10() {
case in
10) echo "10000000000" ;;
9) echo "1000000000" ;;
8) echo "100000000" ;;
7) echo "10000000" ;;
6) echo "1000000" ;;
5) echo "100000" ;;
4) echo "10000" ;;
3) echo "1000" ;;
2) echo "100" ;;
1) echo "10" ;;
0) echo "1" ;;
-1) echo "0.1" ;;
-2) echo "0.01" ;;
-3) echo "0.001" ;;
-4) echo "0.0001" ;;
-5) echo "0.00001" ;;
-6) echo "0.000001" ;;
-7) echo "0.0000001" ;;
-8) echo "0.00000001" ;;
-9) echo "0.000000001" ;;
-10) echo "0.0000000001" ;;
esac
}
function RoundArbitraryDigit() {
pow=$( PowerOf10 )
absPow=;
absPow=${absPow/#-/}
if [[ == -* ]]; then
offset=-0.5
else
offset=0.5
fi
if [[ == -* ]]; then
invPow=$( PowerOf10 $absPow )
elif [[ == 0 ]]; then
invPow="1"
else
invPow=$( PowerOf10 "-" )
fi
shiftedVal=$( CleanBC "$invPow*+$offset" )
shiftedVal=${shiftedVal/.*/}
val=$( CleanBC "scale=15;$shiftedVal*$pow" )
#printf "%.${absPow}f" $val
echo $val
}
function Flt2Int() {
RoundArbitraryDigit 0
}
function Round() {
for v in ${@:3}; do
div=$( CleanBC "$v / " )
case "${1^^}" in
CLOSEST)
val=$( Flt2Int $div );
;;
UP)
#truncate the decimal
val=${div/.*/}
((val++))
;;
DOWN)
#truncate the decimal
val=${div/.*/}
;;
esac
CleanBC "$val * "
done
}
function Tabulate() {
roundTo=$( Log10Float )
if [[ $roundTo == -* ]]; then
roundTo=$( CleanBC "$roundTo -1" )
fi
roundTo=${roundTo/.*/}
roundedSD=$( RoundArbitraryDigit $roundTo )
if [[ $roundTo == -* ]]; then
invPow=$( PowerOf10 ${roundTo/#-/} )
elif [[ $roundTo == 0 ]]; then
invPow="1"
else
invPow=$( PowerOf10 "-${roundTo}" )
fi
if [[ $( CleanBC "($invPow * $roundedSD)" ) == "10" ]]; then
((roundTo++))
roundedSD=$( RoundArbitraryDigit $roundedSD $roundTo )
if [[ $roundTo == -* ]]; then
invPow=$( PowerOf10 ${roundTo/#-/} )
elif [[ $roundTo == 0 ]]; then
invPow="1"
else
invPow=$( PowerOf10 "-${roundTo}" )
fi
fi
intSD=$( CleanBC "($invPow * $roundedSD)" | sed -e 's#\.[0-9]*##g' )
pow=$( PowerOf10 $roundTo )
if [[ $pow != 0.* ]] && [[ $pow != "1" ]]; then
intSD=$( CleanBC "$pow*$intSD" )
fi
val="$( RoundArbitraryDigit $roundTo )"
if [[ $roundTo != -* ]]; then
echo "${val/.*/}(${intSD})"
else
echo "${val}(${intSD})"
fi
}
function test() {
Tabulate '-.9782000' '0.0051335'
Tabulate '105.843516' '8.7571141'
Tabulate '1055.843516' '85.7571141'
Tabulate '0.2581699' '0.0020283'
Tabulate '3.4368211' '0.0739912'
}
time test
这里的主要区别是我删除了一些多余的 shell 调用,用字符串操作替换了其他调用(包括删除了基于 bc
的条件逻辑)。新时间是:
real 0m2.566s
user 0m0.605s
sys 0m1.619s
这大约是五倍的加速!
虽然我仍在考虑移植到 Python 脚本(如他所建议的),但现在我对我的结果非常满意,它将我的制表脚本运行时间从大约 5 小时减少到大约一个小时。
The best solution would be to write the script in a language that actually supports native floating point math, i.e. pretty much anything except bash. I'll second cdarke's recommendation of Python, but realistically almost anything would be better than a shell script.
The biggest reason for this is that the shell doesn't actually have much capability itself; what it's really good at is launching other programs (bc
, sed
, etc) to do the actual work. But launching a program is really computationally expensive. In the shell, anything involving an external command, a pipe, or $( ... )
(or its backtick equivalent) will need to create a new process, and that's a huge amount of overhead in the middle of what should be some simple computation. Compare these two snippets:
for ((num=1; num<10000; num++)); do
result="$(echo "$i" | sed 's#0##g')"
done
for ((num=1; num<10000; num++)); do
result="${num//0/}"
done
They both do the same thing (loop through all numbers from 1 to 10,000, then set result
to the number with all "0"s removed). On my computer, the first took 36 seconds, while the second took 0.2 seconds. The reason is simple: in the second, everything is done directly in bash, with no need to create additional processes. The first, on the other hand, has to create a subshell (i.e. another process 运行ning bash) to 运行 the contents of $( ... )
, then another subshell to do the echo
, then process 运行ning sed
to do the substitution. That's three process creations (and exit/cleanups) the computer has to execute every time through the loop. And that's why the first is over 100 times slower than the second.
Consider a third snippet:
TrimZeroes() {
echo "${1//0/}"
}
for ((num=1; num<10000; num++)); do
result="$(TrimZeroes "$num")"
done
Looks like a cleaner (better abstracted) version of the second, right? It took 8 seconds, because the $( ... )
required creating a subshell to 运行 TrimZeroes
in.
Now, look at one line in your script:
roundTo=$( Log10Float )
This creates a subshell to 运行 Log10Float
in. That consists of the single line
echo $( CleanBC "l()/l(10)" )
...which creates another subshell to 运行 CleanBC
in, which does:
echo "${1/[eE][+][0]/*10^}" | bc -l | sed -e 's#^\(-*\)\.#\.#g'
...which creates three more processes, one for each part of the pipeline. That's a total of five processes to take one logarithm!
So, there are a number of things you could do to speed up the script: mostly switching to using bash's builtin string manipulation capabilities, and inlining as many as possible of the function calls. But this'll make the script even messier than it already is, and it'll still be far slower than if it was written in a more appropriate language.
Python is nice. Ruby is nice. Even perl would be much better than shell for this.
我正在处理通常具有平均值和不确定性的大型数据集。通常在发布时只显示一位不确定性数字,并将相应的值四舍五入到该小数位。然后将不确定性括在括号中并附加到缩短的平均字符串。
例如:
Avg: 101.0513213 SD: 0.33129
...会给出:
101.1(3)
在实践中,这听起来很简单,但实际上会变得有些复杂,因为您必须首先计算一位数的标准偏差,然后使用它来确定要将平均值四舍五入到的小数位数。在舍入到 10
的情况下添加(即 0.094
舍入到 0.09
,但 0.095
舍入到 0.1
更改要舍入到的数字)和事实你是四舍五入而不是截断,原则上实现起来有点麻烦。
我有一组 BASH Script
函数,它们使用 printf
、bc
、sed
、echo
调用的组合来完成它。它有效,但计算结果非常慢。这是您可以自己尝试的示例。你应该能够看到它有多慢:
#!/bin/bash
function CleanBC() {
echo "${1/[eE][+][0]/*10^}" | bc -l | \
sed -e 's#^\(-*\)\.#\.#g'
}
function Log10Float() {
echo $( CleanBC "l()/l(10)" )
}
function TruncateDecimal() {
echo | sed -e 's#\.[0-9]*##g'
}
function PowerOf10() {
absPow=$( echo | sed 's#-##g' )
if [[ $( CleanBC "==0" ) -eq '1' ]]; then
echo "1"
elif [[ $( CleanBC ">0" ) -eq '1' ]]; then
echo "1"$(printf '0%.0s' $( seq 1 $absPow ) )
elif [[ $( CleanBC "==-1" ) -eq '1' ]]; then
echo "0.1"
elif [[ $( CleanBC "<-1" ) -eq '1' ]]; then
echo "0."$(printf '0%.0s' $( seq 2 $absPow ) )"1"
fi
}
function RoundArbitraryDigit() {
pow=$( PowerOf10 )
offset=$( CleanBC "if (>=0) {0.5} else {-0.5}" )
absPow=$( echo | sed -e 's#-##g' )
invPow=$( PowerOf10 $( CleanBC "*-1" ) )
shiftedVal=$( TruncateDecimal $( CleanBC "$invPow*+$offset" ) )
val=$( CleanBC "scale=15;$shiftedVal*$pow" )
echo $( printf "%.${absPow}f" $val )
}
function Flt2Int() {
RoundArbitraryDigit 0
}
function Round() {
for v in ${@:3}; do
div=$( CleanBC "$v / " )
case $( echo | tr '[:lower:]' '[:upper:]' ) in
CLOSEST)
val=$( TruncateDecimal $( Flt2Int $div ) );
;;
UP)
val=$( TruncateDecimal $div );
((val++))
;;
DOWN)
val=$( TruncateDecimal $div );
;;
esac
echo $( CleanBC "$val * " )
done
}
function Tabulate() {
roundTo=$( Log10Float )
roundTo=$( CleanBC "if ($roundTo < 0) {$roundTo -1} else {$roundTo}" )
roundTo=$( TruncateDecimal $roundTo )
roundedSD=$( RoundArbitraryDigit $roundTo )
invPow=$( PowerOf10 $( CleanBC "$roundTo*-1" ) )
if [[ $( CleanBC "($invPow * $roundedSD) == 10" ) -eq '1' ]]; then
((roundTo++))
roundedSD=$( RoundArbitraryDigit $roundedSD $roundTo )
invPow=$( PowerOf10 $( CleanBC "$roundTo*-1" ) )
fi
intSD=$( CleanBC "($invPow * $roundedSD)" | sed -e 's#\.[0-9]*##g' )
pow=$( PowerOf10 $roundTo )
intSD=$( CleanBC "if ($pow > 10 ) {$pow*$intSD} else {$intSD}" )
val="$( RoundArbitraryDigit $roundTo )"
if [[ $( CleanBC "$roundTo > -1" ) -eq '1' ]]; then
val=$( echo $val | sed -e 's#\.0*$##g' )
fi
echo "$val(${intSD})"
}
Tabulate '-.9782000' '0.0051335'
Tabulate '105.843516' '8.7571141'
Tabulate '0.2581699' '0.0020283'
Tabulate '3.4368211' '0.0739912'
我的第一个问题是特定函数或代码块是否比其他函数或代码块更能减慢整体计算速度。
其次,我想就如何在考虑第一个答案的情况下提高整体代码的速度提出建议。
第三,作为一个更笼统的问题,我想知道在这种情况下可以使用哪些工具来分析 bash 脚本并识别瓶颈。
(注意:使用CleanBC
函数是因为有时用例中的其他相关函数会生成科学计数法数字,即2.41321E+05
等。因此需要此函数来保持bc
失败——我的用例中的附加要求。)
感谢@Gordon Davisson 的建议,我改进了脚本。为了更好地计时并检测另一个边缘案例,我将旧脚本的结束行修改为:
function test() {
Tabulate '-.9782000' '0.0051335'
Tabulate '105.843516' '8.7571141'
Tabulate '1055.843516' '85.7571141'
Tabulate '0.2581699' '0.0020283'
Tabulate '3.4368211' '0.0739912'
}
time test
使用旧脚本我得到:
real 0m12.627s
user 0m3.150s
sys 0m9.282s
新脚本是:
#!/bin/bash
function CleanBC() {
val=$(bc -l <<< "${1/*[eE][+][0]/*10^}")
if [[ $val == -* ]]; then
echo "${val/#-./-0.}"
else
echo "${val/#./0.}"
fi
}
function Log10Float() {
CleanBC "l()/l(10)"
}
function TruncateDecimal() {
echo ${1/.*/}
}
function PowerOf10() {
case in
10) echo "10000000000" ;;
9) echo "1000000000" ;;
8) echo "100000000" ;;
7) echo "10000000" ;;
6) echo "1000000" ;;
5) echo "100000" ;;
4) echo "10000" ;;
3) echo "1000" ;;
2) echo "100" ;;
1) echo "10" ;;
0) echo "1" ;;
-1) echo "0.1" ;;
-2) echo "0.01" ;;
-3) echo "0.001" ;;
-4) echo "0.0001" ;;
-5) echo "0.00001" ;;
-6) echo "0.000001" ;;
-7) echo "0.0000001" ;;
-8) echo "0.00000001" ;;
-9) echo "0.000000001" ;;
-10) echo "0.0000000001" ;;
esac
}
function RoundArbitraryDigit() {
pow=$( PowerOf10 )
absPow=;
absPow=${absPow/#-/}
if [[ == -* ]]; then
offset=-0.5
else
offset=0.5
fi
if [[ == -* ]]; then
invPow=$( PowerOf10 $absPow )
elif [[ == 0 ]]; then
invPow="1"
else
invPow=$( PowerOf10 "-" )
fi
shiftedVal=$( CleanBC "$invPow*+$offset" )
shiftedVal=${shiftedVal/.*/}
val=$( CleanBC "scale=15;$shiftedVal*$pow" )
#printf "%.${absPow}f" $val
echo $val
}
function Flt2Int() {
RoundArbitraryDigit 0
}
function Round() {
for v in ${@:3}; do
div=$( CleanBC "$v / " )
case "${1^^}" in
CLOSEST)
val=$( Flt2Int $div );
;;
UP)
#truncate the decimal
val=${div/.*/}
((val++))
;;
DOWN)
#truncate the decimal
val=${div/.*/}
;;
esac
CleanBC "$val * "
done
}
function Tabulate() {
roundTo=$( Log10Float )
if [[ $roundTo == -* ]]; then
roundTo=$( CleanBC "$roundTo -1" )
fi
roundTo=${roundTo/.*/}
roundedSD=$( RoundArbitraryDigit $roundTo )
if [[ $roundTo == -* ]]; then
invPow=$( PowerOf10 ${roundTo/#-/} )
elif [[ $roundTo == 0 ]]; then
invPow="1"
else
invPow=$( PowerOf10 "-${roundTo}" )
fi
if [[ $( CleanBC "($invPow * $roundedSD)" ) == "10" ]]; then
((roundTo++))
roundedSD=$( RoundArbitraryDigit $roundedSD $roundTo )
if [[ $roundTo == -* ]]; then
invPow=$( PowerOf10 ${roundTo/#-/} )
elif [[ $roundTo == 0 ]]; then
invPow="1"
else
invPow=$( PowerOf10 "-${roundTo}" )
fi
fi
intSD=$( CleanBC "($invPow * $roundedSD)" | sed -e 's#\.[0-9]*##g' )
pow=$( PowerOf10 $roundTo )
if [[ $pow != 0.* ]] && [[ $pow != "1" ]]; then
intSD=$( CleanBC "$pow*$intSD" )
fi
val="$( RoundArbitraryDigit $roundTo )"
if [[ $roundTo != -* ]]; then
echo "${val/.*/}(${intSD})"
else
echo "${val}(${intSD})"
fi
}
function test() {
Tabulate '-.9782000' '0.0051335'
Tabulate '105.843516' '8.7571141'
Tabulate '1055.843516' '85.7571141'
Tabulate '0.2581699' '0.0020283'
Tabulate '3.4368211' '0.0739912'
}
time test
这里的主要区别是我删除了一些多余的 shell 调用,用字符串操作替换了其他调用(包括删除了基于 bc
的条件逻辑)。新时间是:
real 0m2.566s
user 0m0.605s
sys 0m1.619s
这大约是五倍的加速!
虽然我仍在考虑移植到 Python 脚本(如他所建议的),但现在我对我的结果非常满意,它将我的制表脚本运行时间从大约 5 小时减少到大约一个小时。
The best solution would be to write the script in a language that actually supports native floating point math, i.e. pretty much anything except bash. I'll second cdarke's recommendation of Python, but realistically almost anything would be better than a shell script.
The biggest reason for this is that the shell doesn't actually have much capability itself; what it's really good at is launching other programs (bc
, sed
, etc) to do the actual work. But launching a program is really computationally expensive. In the shell, anything involving an external command, a pipe, or $( ... )
(or its backtick equivalent) will need to create a new process, and that's a huge amount of overhead in the middle of what should be some simple computation. Compare these two snippets:
for ((num=1; num<10000; num++)); do
result="$(echo "$i" | sed 's#0##g')"
done
for ((num=1; num<10000; num++)); do
result="${num//0/}"
done
They both do the same thing (loop through all numbers from 1 to 10,000, then set result
to the number with all "0"s removed). On my computer, the first took 36 seconds, while the second took 0.2 seconds. The reason is simple: in the second, everything is done directly in bash, with no need to create additional processes. The first, on the other hand, has to create a subshell (i.e. another process 运行ning bash) to 运行 the contents of $( ... )
, then another subshell to do the echo
, then process 运行ning sed
to do the substitution. That's three process creations (and exit/cleanups) the computer has to execute every time through the loop. And that's why the first is over 100 times slower than the second.
Consider a third snippet:
TrimZeroes() {
echo "${1//0/}"
}
for ((num=1; num<10000; num++)); do
result="$(TrimZeroes "$num")"
done
Looks like a cleaner (better abstracted) version of the second, right? It took 8 seconds, because the $( ... )
required creating a subshell to 运行 TrimZeroes
in.
Now, look at one line in your script:
roundTo=$( Log10Float )
This creates a subshell to 运行 Log10Float
in. That consists of the single line
echo $( CleanBC "l()/l(10)" )
...which creates another subshell to 运行 CleanBC
in, which does:
echo "${1/[eE][+][0]/*10^}" | bc -l | sed -e 's#^\(-*\)\.#\.#g'
...which creates three more processes, one for each part of the pipeline. That's a total of five processes to take one logarithm!
So, there are a number of things you could do to speed up the script: mostly switching to using bash's builtin string manipulation capabilities, and inlining as many as possible of the function calls. But this'll make the script even messier than it already is, and it'll still be far slower than if it was written in a more appropriate language.
Python is nice. Ruby is nice. Even perl would be much better than shell for this.