AWK:对多列数据填充的操作
AWK: operations on multiple-columns data filles
我正在通过集成到 bash 脚本(处理数据文件)中的以下 AWK 代码(进行所有统计计算)来处理多个多列格式的数据文件的分析:
#!/bin/bash
home="$PWD"
# folder with the outputs
rescore="${home}"/rescore
# folder with the folders to analyse
storage="${home}"/results
#cd "${home}"/results
cd ${storage}
csv_pattern='*_filt.csv'
while read -r d; do
awk '
FNR==1 {
if (n) { # calculate the results of previous file
m = s / n # mean
mean[suffix] = m # store the mean in an array
lowest[suffix] = min # lowest value of dG - correspond to the upper number in the original CSV
}
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix)
sub(/^.*_/, "", suffix)
s = 0 # sum of
s2 = 0 # sum of ** 2
n = 0 # count of samples
min = 0 # highest value of
}
FNR > 1 {
s +=
s2 += *
++n
if ( < min) min = # update the lowest value
}
END {
if (n) { # just to avoid division by zero
m = s / n
lowest[suffix] = min
}
print "Lig(CNE)", "dG(mean)", "dG(min)"
for (i in mean)
printf "%s %.2f %.2f %.2f\n", i, mean[i], lowest[i]
}' "${d}_"*/${str} > "${rescore}/${str_name}/"${d%%_*}".csv"
done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[]++ {print }')
基本上,在循环运行时,脚本会为每个 CSV 文件计算第三列 (dG) 中数字的平均值,并检测其最小值(始终对应于 ID= 的行1):
# input *_filt.csv located in the folder 10V1_cne_lig1001
ID, POP, dG
1, 142, -5.6500 # this is dG min
2, 10, -5.5000
3, 2, -4.9500
4, 150, -4.1200 # this is pop(MAX)
并将结果保存在另一个多列输出文件中(对于 10 个处理的 CSV,它计为 10 行),包含每个处理的 CSV 的名称的一部分(相应的前缀用作行的 ID),它的 dG(平均值)和 dG(最小值):
# output.csv
Lig(CNE) dG(mean) dG(min)
lig1 -6.78 -7.23
lig2 -5.56 -5.76
lig3 -7.30 -8.69
lig4 -7.98 -8.60
lig5 -6.78 -7.16
lig6 -6.24 -6.50
lig7 -7.44 -8.01
lig8 -4.62 -5.60
lig9 -7.26 -7.48
lig10 -5.9 -6.03
我需要在代码的 AWK 部分添加一种可能性,以检测并在单个列中打印来自 $3 (dG) 的值,这些值将在初始 csv 的 $2(pop 列)中具有最大值。在上面的示例中,dG 的这个值是 -4.1200,根据在第二列中检测到的最高数字 (150),它对应于 CSV 的第 4 行。因此,目的是打印到 output.csv 第四列,其中将包含与 $2 (pop) 中的最大值相对应的 $3 (dG) 值。
至于awk
部分,请你试试:
awk -F ", *" ' # set field separator to comma, followed by 0 or more whitespaces
FNR==1 {
if (n) { # calculate the results of previous file
m = s / n # mean
mean[suffix] = m # store the mean in an array
lowest[suffix] = min # lowest value of dG - correspond to the upper number in the original CSV
highest[suffix] = fourth # dG of highest pop
}
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix)
sub(/^.*_/, "", suffix)
s = 0 # sum of
s2 = 0 # sum of ** 2
n = 0 # count of samples
min = 0 # lowest value of (assuming all < 0)
max = 0 # highest value of (assuming all > 0)
}
FNR > 1 {
s +=
s2 += *
++n
if ( < min) min = # update the lowest value
if ( > max) {
max = # update the highest value
fourth = # to be printed in the fourth column
}
}
END {
if (n) { # just to avoid division by zero
m = s / n
mean[suffix] = m # store the mean in an array
lowest[suffix] = min
highest[suffix] = fourth # dG of highest pop
}
print "Lig(CNE)", "dG(mean)", "dG(min)", "dG(highest pop)"
for (i in mean)
printf "%s %.2f %.2f %.2f\n", i, mean[i], lowest[i], highest[i]
}' *_filt.csv
- 将字段分隔符设置为
,
很重要。否则数值比较可能会中断。
- 我保留了一些未使用的变量(例如s2),可能会用于您以后的更新计划。
我正在通过集成到 bash 脚本(处理数据文件)中的以下 AWK 代码(进行所有统计计算)来处理多个多列格式的数据文件的分析:
#!/bin/bash
home="$PWD"
# folder with the outputs
rescore="${home}"/rescore
# folder with the folders to analyse
storage="${home}"/results
#cd "${home}"/results
cd ${storage}
csv_pattern='*_filt.csv'
while read -r d; do
awk '
FNR==1 {
if (n) { # calculate the results of previous file
m = s / n # mean
mean[suffix] = m # store the mean in an array
lowest[suffix] = min # lowest value of dG - correspond to the upper number in the original CSV
}
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix)
sub(/^.*_/, "", suffix)
s = 0 # sum of
s2 = 0 # sum of ** 2
n = 0 # count of samples
min = 0 # highest value of
}
FNR > 1 {
s +=
s2 += *
++n
if ( < min) min = # update the lowest value
}
END {
if (n) { # just to avoid division by zero
m = s / n
lowest[suffix] = min
}
print "Lig(CNE)", "dG(mean)", "dG(min)"
for (i in mean)
printf "%s %.2f %.2f %.2f\n", i, mean[i], lowest[i]
}' "${d}_"*/${str} > "${rescore}/${str_name}/"${d%%_*}".csv"
done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[]++ {print }')
基本上,在循环运行时,脚本会为每个 CSV 文件计算第三列 (dG) 中数字的平均值,并检测其最小值(始终对应于 ID= 的行1):
# input *_filt.csv located in the folder 10V1_cne_lig1001
ID, POP, dG
1, 142, -5.6500 # this is dG min
2, 10, -5.5000
3, 2, -4.9500
4, 150, -4.1200 # this is pop(MAX)
并将结果保存在另一个多列输出文件中(对于 10 个处理的 CSV,它计为 10 行),包含每个处理的 CSV 的名称的一部分(相应的前缀用作行的 ID),它的 dG(平均值)和 dG(最小值):
# output.csv
Lig(CNE) dG(mean) dG(min)
lig1 -6.78 -7.23
lig2 -5.56 -5.76
lig3 -7.30 -8.69
lig4 -7.98 -8.60
lig5 -6.78 -7.16
lig6 -6.24 -6.50
lig7 -7.44 -8.01
lig8 -4.62 -5.60
lig9 -7.26 -7.48
lig10 -5.9 -6.03
我需要在代码的 AWK 部分添加一种可能性,以检测并在单个列中打印来自 $3 (dG) 的值,这些值将在初始 csv 的 $2(pop 列)中具有最大值。在上面的示例中,dG 的这个值是 -4.1200,根据在第二列中检测到的最高数字 (150),它对应于 CSV 的第 4 行。因此,目的是打印到 output.csv 第四列,其中将包含与 $2 (pop) 中的最大值相对应的 $3 (dG) 值。
至于awk
部分,请你试试:
awk -F ", *" ' # set field separator to comma, followed by 0 or more whitespaces
FNR==1 {
if (n) { # calculate the results of previous file
m = s / n # mean
mean[suffix] = m # store the mean in an array
lowest[suffix] = min # lowest value of dG - correspond to the upper number in the original CSV
highest[suffix] = fourth # dG of highest pop
}
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix)
sub(/^.*_/, "", suffix)
s = 0 # sum of
s2 = 0 # sum of ** 2
n = 0 # count of samples
min = 0 # lowest value of (assuming all < 0)
max = 0 # highest value of (assuming all > 0)
}
FNR > 1 {
s +=
s2 += *
++n
if ( < min) min = # update the lowest value
if ( > max) {
max = # update the highest value
fourth = # to be printed in the fourth column
}
}
END {
if (n) { # just to avoid division by zero
m = s / n
mean[suffix] = m # store the mean in an array
lowest[suffix] = min
highest[suffix] = fourth # dG of highest pop
}
print "Lig(CNE)", "dG(mean)", "dG(min)", "dG(highest pop)"
for (i in mean)
printf "%s %.2f %.2f %.2f\n", i, mean[i], lowest[i], highest[i]
}' *_filt.csv
- 将字段分隔符设置为
,
很重要。否则数值比较可能会中断。 - 我保留了一些未使用的变量(例如s2),可能会用于您以后的更新计划。