Bash shell for 循环中的任务

Question

我有两个文件：

temp_bandstructure.dat 具有以下格式

# spin    band          kx          ky          kz          E(MF)          E(QP)        Delta E kn  E(MF)5dp
#                        (Cartesian coordinates)             (eV)           (eV)           (eV)     (eV)
     1      22     0.00000     0.00000     0.00000   -3.021665798   -4.022414204   -1.000748406 1   -3.02167
     1      22     0.00850     0.00000     0.00000   -3.026245712   -4.027334803   -1.001089091 2   -3.02625
     1      22     0.01699     0.00000     0.00000   -3.039924052   -4.061680485   -1.021756433 3   -3.03992
9000 more rows

mf_pband.dat 有 46 header 行，后跟以下

  1     0.00000   -55.55593   0.998   0.000 ...20 more columns
9000 more rows

我有一个嵌套的 for 循环，用于比较 mf_pband.dat 中每一行的第 1 列和第 3 列与 temp_bandstructure.dat 中每一行的第 9 列和第 10 列。如果匹配值在 0.00001 以内的数字，则脚本会将 mf_pband.dat 的整行打印到缓存文件中。

我写了一个工作 for 循环来完成工作，但速度非常慢：

kmax=207
bandmin=$(cat bandstructure.dat | awk 'NR==3''{ print }')
bandmax=$(tac bandstructure.dat | awk 'NR==1''{ print }')
nband=$(($bandmax-$bandmin+1))
nheader=46


for ((i=3;i<=$(($kmax*$nband+2)); i++)); do
    kn=$(awk -v i=$i 'NR==i''{ print }'  temp_bandstructure.dat)
    emf=$(awk -v i=$i 'NR==i''{ print }'  temp_bandstructure.dat)
    
    for ((j=$(($nheader+1));j<=$(($kmax*$nband+$nheader)); j++)); do
        kn_mf_pband=$(awk -v j=$j 'NR==j''{ print }'  mf_pband.dat)
        emf_mf_pband=$(awk -v j=$j 'NR==j''{ print }'  mf_pband.dat)
        if [ "$kn" = "$kn_mf_pband" ] && (( $(echo "$emf - $emf_mf_pband <= 0.00001" |bc -l) )) && (( $(echo "$emf_mf_pband - $emf <= 0.00001" |bc -l) ))
        then
            awk -v j=$j 'NR==j' mf_pband.dat >> temp_copying_cache.dat
            echo $i $j $kn $kn_mf_pband $emf $emf_mf_pband
            break
        fi
    done
done

现在我正在尝试将其中一个 for 循环发送到后台任务，这样我就可以运行并行处理其中的许多任务。修改后的代码没有报错，也没有任何进展：

task(){
    kn_mf_pband=$(awk -v j=$j 'NR==j''{ print }'  mf_pband.dat)
    emf_mf_pband=$(awk -v j=$j 'NR==j''{ print }'  mf_pband.dat)
    if [ "$kn" = "$kn_mf_pband" ] && (( $(echo "$emf - $emf_mf_pband <= 0.00001" |bc -l) )) && (( $(echo "$emf_mf_pband - $emf <= 0.00001" |bc -l) ))
    then
        awk -v j=$j 'NR==j' mf_pband.dat >> temp_copying_cache.dat
        echo $i $j $kn $kn_mf_pband $emf $emf_mf_pband
    fi
}


for ((i=3;i<=$(($kmax*$nband+2)); i++)); do
    kn=$(awk -v i=$i 'NR==i''{ print }'  temp_bandstructure.dat)
    emf=$(awk -v i=$i 'NR==i''{ print }'  temp_bandstructure.dat)
    
    for j in {$(($nheader+1))..$(($kmax*$nband+$nheader))}; do
        ((i=i%20)); ((i++==0)) && wait
        task "$j" &
    done
done
wait

谁能告诉我为什么任务不是运行ning，更重要的是，我怎样才能让它们正确地运行？

Answer 1

问题在

for j in {$(($nheader+1))..$(($kmax*$nband+$nheader))}; do
    ((i=i%20)); ((i++==0)) && wait
    task "$j" &
done

在这里，大括号扩展 {$(($nheader+1))..$(($kmax*$nband+$nheader))} 不是扩展为数字列表，而是扩展为文字字符串 {47..1234}（[= 的实际数字14=] 取决于您的文件内容）。
然后你开始 task '{47..1234}' & 什么都不做，因为在 task 中你试图用 awk -v j='{47..1234}' 'NR==j' 提取值，但 NR 永远不会是 {47..1234}.
要解决此问题，请使用 seq 或 for ((...; ...; ...))。参见 How do I iterate over a range of numbers defined by variables in Bash?。

无论如何，您的脚本很慢，因为您一遍又一遍地读取同一个文件（并且因为您启动了很多进程）。 for i ...; do awk -v i=$i 'NR==i'; done 具有二次时间复杂度。您可以尝试重写 awk 中的脚本，只是为了让它更快。首先，将其中一个文件读入数组并保存在内存中，然后处理另一个文件。

这是这样一个 awk 脚本的框架。成语 FNR==NR 只有在处理第一个文件时才成立。

awk -v bandmax="$(tail -n1 bandstructure.dat | awk '{print }')" -v nheader=46 '
  FNR==NR && NR>nheader { kn_mf_pband[NR-nheader]=; em_mf_pband[NR-nheader]= }
  FNR==NR { next }
  # because of the `next` the following rules are only processed for the 2nd file
  FNR==3 { bandmin= }
  FNR>2 {
    # here you can use for loops to iterate over the stored values in
    # kn_mf_pband[...] and em_mf_pband[...]
  }
' mf_pband.dat bandstructure.dat

Bash shell for 循环中的任务

Bash shell task within for loop

parallel-processing

bash

for-loop