AWK:对多列数据填充的操作

AWK: operations on multiple-columns data filles

我正在通过集成到 bash 脚本(处理数据文件)中的以下 AWK 代码(进行所有统计计算)来处理多个多列格式的数据文件的分析:

#!/bin/bash
home="$PWD"
# folder with the outputs
rescore="${home}"/rescore 
# folder with the folders to analyse
storage="${home}"/results
#cd "${home}"/results
cd ${storage}
csv_pattern='*_filt.csv'


while read -r d; do
awk '
FNR==1 {
   if (n) {                     # calculate the results of previous file
      m = s / n                 # mean
      mean[suffix] = m          # store the mean in an array
      lowest[suffix] = min      # lowest value of dG - correspond to the upper number in the original CSV
   }
   prefix=suffix=FILENAME
   sub(/_.*/, "", prefix)
   sub(/\/[^\/]+$/, "", suffix)
   sub(/^.*_/, "", suffix)
   s = 0                        # sum of 
   s2 = 0                       # sum of  ** 2
   n = 0                        # count of samples
   min = 0                      # highest value of 
}
FNR > 1 {
   s += 
   s2 +=  * 
   ++n
   if ( < min) min =        # update the lowest value
}
END {
  if (n) {                     # just to avoid division by zero
   m = s / n
   lowest[suffix] = min
  }
   print "Lig(CNE)", "dG(mean)", "dG(min)"
   for (i in mean)
      printf "%s %.2f %.2f %.2f\n", i, mean[i], lowest[i]
}'  "${d}_"*/${str} > "${rescore}/${str_name}/"${d%%_*}".csv"
done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[]++ {print }')

基本上,在循环运行时,脚本会为每个 CSV 文件计算第三列 (dG) 中数字的平均值,并检测其最小值(始终对应于 ID= 的行1):

# input *_filt.csv located in the folder 10V1_cne_lig1001
ID, POP, dG
1, 142, -5.6500 # this is dG min
2, 10, -5.5000
3, 2, -4.9500
4, 150, -4.1200 # this is pop(MAX)

并将结果保存在另一个多列输出文件中(对于 10 个处理的 CSV,它计为 10 行),包含每个处理的 CSV 的名称的一部分(相应的前缀用作行的 ID),它的 dG(平均值)和 dG(最小值):

# output.csv
Lig(CNE) dG(mean) dG(min)
lig1 -6.78 -7.23
lig2 -5.56 -5.76
lig3 -7.30 -8.69
lig4 -7.98 -8.60
lig5 -6.78 -7.16
lig6 -6.24 -6.50
lig7 -7.44 -8.01
lig8 -4.62 -5.60
lig9 -7.26 -7.48
lig10 -5.9 -6.03

我需要在代码的 AWK 部分添加一种可能性,以检测并在单个列中打印来自 $3 (dG) 的值,这些值将在初始 csv 的 $2(pop 列)中具有最大值。在上面的示例中,dG 的这个值是 -4.1200,根据在第二列中检测到的最高数字 (150),它对应于 CSV 的第 4 行。因此,目的是打印到 output.csv 第四列,其中将包含与 $2 (pop) 中的最大值相对应的 $3 (dG) 值。

至于awk部分,请你试试:

awk -F ", *" '                  # set field separator to comma, followed by 0 or more whitespaces
FNR==1 {
   if (n) {                     # calculate the results of previous file
      m = s / n                 # mean
      mean[suffix] = m          # store the mean in an array
      lowest[suffix] = min      # lowest value of dG - correspond to the upper number in the original CSV
      highest[suffix] = fourth  # dG of highest pop
   }
   prefix=suffix=FILENAME
   sub(/_.*/, "", prefix)
   sub(/\/[^\/]+$/, "", suffix)
   sub(/^.*_/, "", suffix)
   s = 0                        # sum of 
   s2 = 0                       # sum of  ** 2
   n = 0                        # count of samples
   min = 0                      # lowest value of  (assuming all  < 0)
   max = 0                      # highest value of  (assuming all  > 0)
}
FNR > 1 {
   s += 
   s2 +=  * 
   ++n
   if ( < min) min =        # update the lowest value
   if ( > max) {
      max =                   # update the highest value
      fourth =                # to be printed in the fourth column
   }
}
END {
   if (n) {                     # just to avoid division by zero
      m = s / n
      mean[suffix] = m          # store the mean in an array
      lowest[suffix] = min
      highest[suffix] = fourth  # dG of highest pop
   }
   print "Lig(CNE)", "dG(mean)", "dG(min)", "dG(highest pop)"
   for (i in mean)
      printf "%s %.2f %.2f %.2f\n", i, mean[i], lowest[i], highest[i]
}' *_filt.csv
  • 将字段分隔符设置为 , 很重要。否则数值比较可能会中断。
  • 我保留了一些未使用的变量(例如s2),可能会用于您以后的更新计划。