在 bash 中按名称(匹配模式)提取列

Extracting columns by names (matching patterns) in bash

Coding,Value,Meaning,54-1.0,54-2.0,431-2.0,212-0.0,212-1.0
1,1,Yes,0.4,0.3,0.7,0.1,0.6
2,0,Other job (free text entry),0,0.7,0.3,0.7,0.8
2,1,Managers and Senior Officials,0.5,0.2,0.4,0.7,0.7
2,11,Corporate Managers,0.1,0.7,0.4,0.2,0.4
2,111,Corporate Managers And Senior Officials,0,0.8,0.8,0.4,0.8
2,1111,Senior officials in national government,0.9,0.6,0.4,0.2,0.9
2,1111001,AM (National Assembly),0.8,0.3,0.2,0,0.2
2,1111002,Ambassador (Foreign and Commonwealth Office),0.9,0.9,0.7,0.1,0.2
2,1111003,Band 0 (Health and Safety Executive),0.6,0.4,0,0.4,0.8
2,1111004,Band 1B (Meteorological Office),0.6,0.1,0.6,1,0.8

我有一个像上面那样的 csv.gz 文件。我想按名称提取与某些字符串匹配的列,例如,列名称匹配“54-”和“212-”。

我找到了如下解决方案,但我想知道是否可以对其进行修改,以便它可以提取与字符串列表中的任何元素相匹配的列,例如“Meaning”、“54-”、“ 212-".

zcat test.csv.gz |awk -F, 'NR==1{for(i=1;i<=NF;i++)if($i~/54-/)f[n++]=i}{for(i=0;i<n;i++)printf"%s%s",i?" ":"",$f[i];print""}' 

我还想将其保存到 csv.gz 文件中。但是在最后加上 > outputfile.csv ,我不能用逗号分隔。我想知道我应该把 OFS="," 放在这个命令的什么地方?

示例输出如下(在 csv.gz 文件中)

Meaning,54-1.0,54-2.0,212-0.0,212-1.0
Yes,0.4,0.3,0.1,0.6
Other job (free text entry),0,0.7,0.7,0.8
Managers and Senior Officials,0.5,0.2,0.7,0.7
Corporate Managers,0.1,0.7,0.2,0.4
Corporate Managers And Senior Officials,0,0.8,0.4,0.8
Senior officials in national government,0.9,0.6,0.2,0.9
AM (National Assembly),0.8,0.3,0,0.2
Ambassador (Foreign and Commonwealth Office),0.9,0.9,0.1,0.2
Band 0 (Health and Safety Executive),0.6,0.4,0.4,0.8
Band 1B (Meteorological Office),0.6,0.1,1,0.8

谢谢。

希望这有助于根据您的需要更改变量 get

One-liner:

$ awk -v get='^(Meaning|54-|212-)' 'BEGIN{FS=OFS=","}FNR==1{for(i=1;i<=NF;i++)if($i~get)cols[++c]=i}{for(i=1; i<=c; i++)printf "%s%s", $(cols[i]), (i<c ? OFS : ORS)}' file
Meaning,54-1.0,54-2.0,212-0.0,212-1.0
Yes,0.4,0.3,0.1,0.6
Other job (free text entry),0,0.7,0.7,0.8
Managers and Senior Officials,0.5,0.2,0.7,0.7
Corporate Managers,0.1,0.7,0.2,0.4
Corporate Managers And Senior Officials,0,0.8,0.4,0.8
Senior officials in national government,0.9,0.6,0.2,0.9
AM (National Assembly),0.8,0.3,0,0.2
Ambassador (Foreign and Commonwealth Office),0.9,0.9,0.1,0.2
Band 0 (Health and Safety Executive),0.6,0.4,0.4,0.8
Band 1B (Meteorological Office),0.6,0.1,1,0.8

你的情况:

$ zcat test.csv.gz | awk -v get='^(Meaning|54-|212-)' 'BEGIN{FS=OFS=","}FNR==1{for(i=1;i<=NF;i++)if($i~get)cols[++c]=i}{for(i=1; i<=c; i++)printf "%s%s", $(cols[i]), (i<c ? OFS : ORS)}'

更好读:

awk -v get='^(Meaning|54-|212-)' '
         BEGIN{
             FS=OFS=","
         }
         FNR==1{
               for(i=1;i<=NF;i++)
                   if($i~get)cols[++c]=i
         }
         {
           for(i=1; i<=c; i++)
                printf "%s%s", $(cols[i]), (i<c ? OFS : ORS)
         }' file

输入:

$ cat file
Coding,Value,Meaning,54-1.0,54-2.0,431-2.0,212-0.0,212-1.0
1,1,Yes,0.4,0.3,0.7,0.1,0.6
2,0,Other job (free text entry),0,0.7,0.3,0.7,0.8
2,1,Managers and Senior Officials,0.5,0.2,0.4,0.7,0.7
2,11,Corporate Managers,0.1,0.7,0.4,0.2,0.4
2,111,Corporate Managers And Senior Officials,0,0.8,0.8,0.4,0.8
2,1111,Senior officials in national government,0.9,0.6,0.4,0.2,0.9
2,1111001,AM (National Assembly),0.8,0.3,0.2,0,0.2
2,1111002,Ambassador (Foreign and Commonwealth Office),0.9,0.9,0.7,0.1,0.2
2,1111003,Band 0 (Health and Safety Executive),0.6,0.4,0,0.4,0.8
2,1111004,Band 1B (Meteorological Office),0.6,0.1,0.6,1,0.8

这是一个 awk 带有解释的脚本。

注意第 3 行,它在 fieldsIdentifierList 变量中定义了字段标识符列表,您可以修改它。 或者使用 -v 命令行选项用作外部输入变量。

script.awk

BEGIN { # pre process initial values
  OFS = ","; # set output separator to ","
  fieldsIdentifierList = "54-,212-,Meaning"; # list field identifiers
  split(fieldsIdentifierList, fieldsIdentifierArr, ","); # create an array from field identifiers
}
NR == 1 { # process only the first line
  for(i = 1; i <= NF; i++) # for each field
    for(fieldIdentifier in fieldsIdentifierArr) { # and for each field identifiers
      if($i ~ fieldsIdentifierArr[fieldIdentifier]) { # if field match field identifier
        targetFieldsArr[++n]=i; # append field idx to target fields array
      }
    }
}
{ # for each line
  for(field in targetFieldsArr) # for each target field
    printf("%s%s", field > 1? OFS: "", $targetFieldsArr[field]); # print the target field followed by field separatorfollowed by field separator
  print ""; # print end of line.
}

运行 script.awk

zcat test.csv.gz |awk -f script.awk

示例输出

$ awk -f script.awk input.txt
Meaning,54-1.0,54-2.0,212-0.0,212-1.0
Yes,0.4,0.3,0.1,0.6
Other,job,(free,entry),0.0
Managers,and,Senior,0.5,0.2
Corporate,Managers,0.1,0.4,0.2
Corporate,Managers,And,Officials,0.0
Senior,officials,in,government,0.9
AM,(National,Assembly),0.3,0.2
Ambassador,(Foreign,and,Office),0.9
Band,0,(Health,Safety,Executive)
Band,1B,(Meteorological,0.6,0.1

@嘟嘟小子,我用上面的脚本得到了这样的结果

Coding  Value   Meaning 54-1.0  54-2.0  431-2.0 212-0.0 212-1.0 
    Coding  Value   Meaning 54-1.0  54-2.0  431-2.0 212-0.0 212-1.0
    Coding  Value   Meaning 54-1.0  54-2.0  431-2.0 212-0.0 212-1.0
1   1   Yes 0.4 0.3 0.7 0.1 0.6 
    1   1   Yes 0.4 0.3 0.7 0.1 0.6
    1   1   Yes 0.4 0.3 0.7 0.1 0.6
2   0   Other   2   0   Other   2   0   Other
2   1   Managers    2   1   Managers    2   1   Managers
2   11  Corporate   2   11  Corporate   2   11  Corporate
2   111 Corporate   2   111 Corporate   2   111 Corporate
2   1111    Senior  2   1111    Senior  2   1111    Senior
2   1111001 AM  2   1111001 AM  2   1111001 AM
2   1111002 Ambassador  2   1111002 Ambassador  2   1111002 Ambassador
2   1111003 Band    2   1111003 Band    2   1111003 Band
2   1111004 Band    2   1111004 Band    2   1111004 Band