如何将每个列值分配给它的名称？

Question

我有一个 MetaData.csv 文件，其中包含许多用于执行分析的值。我想要的是： 1-读取列名并使变量类似于列名。 2- 将每列中的值作为整数放入变量中，以供其他命令读取。 column_name=Its_value

MetaData.csv:

MAF,HWE,Geno_Missing,Inds_Missing
0.05,1E-06,0.01,0.01

我写了以下代码，但效果不佳：

#!/bin/bash
Col_Names=$(head -n 1 MetaData.csv) # Cut header (camma sep)
Col_Names=$(echo ${Col_Names//,/ }) # Convert header to space sep
Col_Names=($Col_Names) # Convert header to an array 

for i in $(seq 1 ${#Col_Names[@]}); do
N="$(head -1 MetaData.csv | tr ',' '\n' | nl |grep -w 
"${Col_Names[$i]}" | tr -d " " | awk -F " " '{print }')";
${Col_Names[$i]}="$(cat MetaData.csv | cut -d"," -f$N | sed '1d')";
done

输出：

HWE=1E-06: command not found
Geno_Missing=0.01: command not found
Inds_Missing=0.01: command not found
cut: 2: No such file or directory
cut: 3: No such file or directory
cut: 4: No such file or directory
=: command not found

预期输出：

MAF=0.05
HWE=1E-06
Geno_Missing=0.01
Inds_Missing=0.01

问题：

1- 我想使用数组长度 (${#Col_Names[@]}) 作为最终迭代，即 5，但数组索引从 0 (0-4) 开始。因此 MAF 列未被循环捕获。循环也迭代两次（一次 0-4 和一次 2-4！）。 2- 当我尝试调用变量中的值时 (echo $MAF)，它们是空的！

非常感谢任何解决方案。

Answer 1

我真的不认为您可以在 Bash 中实现强大的 CSV reader/parser，但您可以实现它以在某种程度上使用简单的 CSV 文件。例如，一个非常简单的 bash 实现的 CSV 可能如下所示：

#!/bin/bash

set -e

ROW_NUMBER='0'
HEADERS=()
while IFS=',' read -ra ROW; do
    if test "$ROW_NUMBER" == '0'; then
        for (( I = 0; I < ${#ROW[@]}; I++ )); do
            HEADERS["$I"]="${ROW[I]}"
        done
    else
        declare -A DATA_ROW_MAP
        for (( I = 0; I < ${#ROW[@]}; I++ )); do
            DATA_ROW_MAP[${HEADERS["$I"]}]="${ROW[I]}"
        done
# DEMO {
        echo -e "${DATA_ROW_MAP['Fnames']}\t${DATA_ROW_MAP['Inds_Missing']}"
# } DEMO
        unset DATA_ROW_MAP
    fi
    ROW_NUMBER=$((ROW_NUMBER + 1))
done

请注意，它有多个缺点：

它仅适用于 , 分隔的字段（真正的“C”SV）；
无法处理多行记录；
它无法处理字段转义；
它认为第一行始终代表 header 行。

这就是为什么许多命令可能会生成和使用 [=15=] 分隔数据的原因，因为此控制字符可能更易于使用。现在我不确定 test 是否是 bash 执行的唯一外部命令（我相信是，但它可能是 re-implemented 使用 case 所以没有外部 test 被执行？）。

使用示例（带有演示输出）：

./read-csv.sh < MetaData.csv

19.vcf.gz    0.01
20.vcf.gz
21.vcf.gz
22.vcf.gz

我根本不推荐使用这个解析器，但会推荐使用更多 CSV-oriented 工具（Python 可能是最容易使用的选择；+ 或者如果您喜欢的语言，正如你提到的，是 R，那么这可能是你的另一个选择：Run R script from command line）。

Answer 2

如果我没有正确理解您的要求，请您尝试这样的操作：

#!/bin/bash

nr=1                                    # initialize input line number to 1
while IFS=, read -r -a ary; do          # split the line on "," then assign "ary" to the fields
    if (( nr == 1 )); then              # handle the header line
        col_names=("${ary[@]}")         # assign column names
    else                                # handle the body lines
        for (( i = 0; i < ${#ary[@]}; i++ )); do
            printf -v "${col_names[i]}" "${ary[i]}"
                                        # assign the variable "${col_names[i]}" to the input field
        done
        # now you can access the values via its column name
        echo "Fnames=$Fnames"
        echo "MAF=$MAF"
        fname_list+=("$Fnames")         # create a list of Fnames
    fi
    (( nr++ ))                          # increment the input line number
done < MetaData.csv
echo "${fname_list[@]}"                 # print the list of Fnames

输出：

Fnames=19.vcf.gz
MAF=0.05
Fnames=20.vcf.gz
MAF=
Fnames=21.vcf.gz
MAF=
Fnames=22.vcf.gz
MAF=
19.vcf.gz 20.vcf.gz 21.vcf.gz 22.vcf.gz

statetemt IFS=, read -a ary 基本上等同于你的前三行；它将输入拆分为“，”，并分配给数组变量 ary 到字段值。
有几种方法可以使用变量的值作为变量名（间接变量引用）。 printf -v VarName Value就是其中之一。

[编辑]

根据 OP 更新的输入文件，这里是另一个版本：

#!/bin/bash

nr=1                                    # initialize input line number to 1
while IFS=, read -r -a ary; do          # split the line on "," then assign "ary" to the fields
    if (( nr == 1 )); then              # handle the header line
        col_names=("${ary[@]}")         # assign column names
    else                                # handle the body lines
        for (( i = 0; i < ${#ary[@]}; i++ )); do
            printf -v "${col_names[i]}" "${ary[i]}"
                                        # assign the variable "${col_names[i]}" to the input field
        done
    fi
    (( nr++ ))                          # increment the input line number
done < MetaData.csv

for n in "${col_names[@]}"; do          # iterate over the variable names
    echo "$n=${!n}"                     # print variable name and its value
done

# you can also specify the variable names literally as follows:
echo "MAF=$MAF HWE=$HWE Geno_Missing=$Geno_Missing Inds_Missing=$Inds_Missing"

输出：

MAF=0.05
HWE=1E-06
Geno_Missing=0.01
Inds_Missing=0.01
MAF=0.05 HWE=1E-06 Geno_Missing=0.01 Inds_Missing=0.01

至于输出，前四行由echo "$n=${!n}"打印，最后一行由echo "MAF=$MAF ...打印。您可以根据您在以下代码中对变量的使用情况选择任一语句。

Answer 3

这会根据您发布的示例输入生成您发布的预期输出：

$ awk -F, -v OFS='=' 'NR==1{split([=10=],hdr); next} {for (i=1;i<=NF;i++) print hdr[i], $i}' MetaData.csv
MAF=0.05
HWE=1E-06
Geno_Missing=0.01
Inds_Missing=0.01

如果这不是您所需要的全部，请编辑您的问题以阐明您的要求。

如何将每个列值分配给它的名称？

How can I assign each column value to Its name?

csv

syntax

bash