在 bash 中按名称(匹配模式)提取列
Extracting columns by names (matching patterns) in bash
Coding,Value,Meaning,54-1.0,54-2.0,431-2.0,212-0.0,212-1.0
1,1,Yes,0.4,0.3,0.7,0.1,0.6
2,0,Other job (free text entry),0,0.7,0.3,0.7,0.8
2,1,Managers and Senior Officials,0.5,0.2,0.4,0.7,0.7
2,11,Corporate Managers,0.1,0.7,0.4,0.2,0.4
2,111,Corporate Managers And Senior Officials,0,0.8,0.8,0.4,0.8
2,1111,Senior officials in national government,0.9,0.6,0.4,0.2,0.9
2,1111001,AM (National Assembly),0.8,0.3,0.2,0,0.2
2,1111002,Ambassador (Foreign and Commonwealth Office),0.9,0.9,0.7,0.1,0.2
2,1111003,Band 0 (Health and Safety Executive),0.6,0.4,0,0.4,0.8
2,1111004,Band 1B (Meteorological Office),0.6,0.1,0.6,1,0.8
我有一个像上面那样的 csv.gz 文件。我想按名称提取与某些字符串匹配的列,例如,列名称匹配“54-”和“212-”。
我找到了如下解决方案,但我想知道是否可以对其进行修改,以便它可以提取与字符串列表中的任何元素相匹配的列,例如“Meaning”、“54-”、“ 212-".
zcat test.csv.gz |awk -F, 'NR==1{for(i=1;i<=NF;i++)if($i~/54-/)f[n++]=i}{for(i=0;i<n;i++)printf"%s%s",i?" ":"",$f[i];print""}'
我还想将其保存到 csv.gz 文件中。但是在最后加上 > outputfile.csv
,我不能用逗号分隔。我想知道我应该把 OFS=","
放在这个命令的什么地方?
示例输出如下(在 csv.gz 文件中)
Meaning,54-1.0,54-2.0,212-0.0,212-1.0
Yes,0.4,0.3,0.1,0.6
Other job (free text entry),0,0.7,0.7,0.8
Managers and Senior Officials,0.5,0.2,0.7,0.7
Corporate Managers,0.1,0.7,0.2,0.4
Corporate Managers And Senior Officials,0,0.8,0.4,0.8
Senior officials in national government,0.9,0.6,0.2,0.9
AM (National Assembly),0.8,0.3,0,0.2
Ambassador (Foreign and Commonwealth Office),0.9,0.9,0.1,0.2
Band 0 (Health and Safety Executive),0.6,0.4,0.4,0.8
Band 1B (Meteorological Office),0.6,0.1,1,0.8
谢谢。
希望这有助于根据您的需要更改变量 get
:
One-liner:
$ awk -v get='^(Meaning|54-|212-)' 'BEGIN{FS=OFS=","}FNR==1{for(i=1;i<=NF;i++)if($i~get)cols[++c]=i}{for(i=1; i<=c; i++)printf "%s%s", $(cols[i]), (i<c ? OFS : ORS)}' file
Meaning,54-1.0,54-2.0,212-0.0,212-1.0
Yes,0.4,0.3,0.1,0.6
Other job (free text entry),0,0.7,0.7,0.8
Managers and Senior Officials,0.5,0.2,0.7,0.7
Corporate Managers,0.1,0.7,0.2,0.4
Corporate Managers And Senior Officials,0,0.8,0.4,0.8
Senior officials in national government,0.9,0.6,0.2,0.9
AM (National Assembly),0.8,0.3,0,0.2
Ambassador (Foreign and Commonwealth Office),0.9,0.9,0.1,0.2
Band 0 (Health and Safety Executive),0.6,0.4,0.4,0.8
Band 1B (Meteorological Office),0.6,0.1,1,0.8
你的情况:
$ zcat test.csv.gz | awk -v get='^(Meaning|54-|212-)' 'BEGIN{FS=OFS=","}FNR==1{for(i=1;i<=NF;i++)if($i~get)cols[++c]=i}{for(i=1; i<=c; i++)printf "%s%s", $(cols[i]), (i<c ? OFS : ORS)}'
更好读:
awk -v get='^(Meaning|54-|212-)' '
BEGIN{
FS=OFS=","
}
FNR==1{
for(i=1;i<=NF;i++)
if($i~get)cols[++c]=i
}
{
for(i=1; i<=c; i++)
printf "%s%s", $(cols[i]), (i<c ? OFS : ORS)
}' file
输入:
$ cat file
Coding,Value,Meaning,54-1.0,54-2.0,431-2.0,212-0.0,212-1.0
1,1,Yes,0.4,0.3,0.7,0.1,0.6
2,0,Other job (free text entry),0,0.7,0.3,0.7,0.8
2,1,Managers and Senior Officials,0.5,0.2,0.4,0.7,0.7
2,11,Corporate Managers,0.1,0.7,0.4,0.2,0.4
2,111,Corporate Managers And Senior Officials,0,0.8,0.8,0.4,0.8
2,1111,Senior officials in national government,0.9,0.6,0.4,0.2,0.9
2,1111001,AM (National Assembly),0.8,0.3,0.2,0,0.2
2,1111002,Ambassador (Foreign and Commonwealth Office),0.9,0.9,0.7,0.1,0.2
2,1111003,Band 0 (Health and Safety Executive),0.6,0.4,0,0.4,0.8
2,1111004,Band 1B (Meteorological Office),0.6,0.1,0.6,1,0.8
这是一个 awk
带有解释的脚本。
注意第 3 行,它在 fieldsIdentifierList
变量中定义了字段标识符列表,您可以修改它。
或者使用 -v
命令行选项用作外部输入变量。
script.awk
BEGIN { # pre process initial values
OFS = ","; # set output separator to ","
fieldsIdentifierList = "54-,212-,Meaning"; # list field identifiers
split(fieldsIdentifierList, fieldsIdentifierArr, ","); # create an array from field identifiers
}
NR == 1 { # process only the first line
for(i = 1; i <= NF; i++) # for each field
for(fieldIdentifier in fieldsIdentifierArr) { # and for each field identifiers
if($i ~ fieldsIdentifierArr[fieldIdentifier]) { # if field match field identifier
targetFieldsArr[++n]=i; # append field idx to target fields array
}
}
}
{ # for each line
for(field in targetFieldsArr) # for each target field
printf("%s%s", field > 1? OFS: "", $targetFieldsArr[field]); # print the target field followed by field separatorfollowed by field separator
print ""; # print end of line.
}
运行 script.awk
zcat test.csv.gz |awk -f script.awk
示例输出
$ awk -f script.awk input.txt
Meaning,54-1.0,54-2.0,212-0.0,212-1.0
Yes,0.4,0.3,0.1,0.6
Other,job,(free,entry),0.0
Managers,and,Senior,0.5,0.2
Corporate,Managers,0.1,0.4,0.2
Corporate,Managers,And,Officials,0.0
Senior,officials,in,government,0.9
AM,(National,Assembly),0.3,0.2
Ambassador,(Foreign,and,Office),0.9
Band,0,(Health,Safety,Executive)
Band,1B,(Meteorological,0.6,0.1
@嘟嘟小子,我用上面的脚本得到了这样的结果
Coding Value Meaning 54-1.0 54-2.0 431-2.0 212-0.0 212-1.0
Coding Value Meaning 54-1.0 54-2.0 431-2.0 212-0.0 212-1.0
Coding Value Meaning 54-1.0 54-2.0 431-2.0 212-0.0 212-1.0
1 1 Yes 0.4 0.3 0.7 0.1 0.6
1 1 Yes 0.4 0.3 0.7 0.1 0.6
1 1 Yes 0.4 0.3 0.7 0.1 0.6
2 0 Other 2 0 Other 2 0 Other
2 1 Managers 2 1 Managers 2 1 Managers
2 11 Corporate 2 11 Corporate 2 11 Corporate
2 111 Corporate 2 111 Corporate 2 111 Corporate
2 1111 Senior 2 1111 Senior 2 1111 Senior
2 1111001 AM 2 1111001 AM 2 1111001 AM
2 1111002 Ambassador 2 1111002 Ambassador 2 1111002 Ambassador
2 1111003 Band 2 1111003 Band 2 1111003 Band
2 1111004 Band 2 1111004 Band 2 1111004 Band
Coding,Value,Meaning,54-1.0,54-2.0,431-2.0,212-0.0,212-1.0
1,1,Yes,0.4,0.3,0.7,0.1,0.6
2,0,Other job (free text entry),0,0.7,0.3,0.7,0.8
2,1,Managers and Senior Officials,0.5,0.2,0.4,0.7,0.7
2,11,Corporate Managers,0.1,0.7,0.4,0.2,0.4
2,111,Corporate Managers And Senior Officials,0,0.8,0.8,0.4,0.8
2,1111,Senior officials in national government,0.9,0.6,0.4,0.2,0.9
2,1111001,AM (National Assembly),0.8,0.3,0.2,0,0.2
2,1111002,Ambassador (Foreign and Commonwealth Office),0.9,0.9,0.7,0.1,0.2
2,1111003,Band 0 (Health and Safety Executive),0.6,0.4,0,0.4,0.8
2,1111004,Band 1B (Meteorological Office),0.6,0.1,0.6,1,0.8
我有一个像上面那样的 csv.gz 文件。我想按名称提取与某些字符串匹配的列,例如,列名称匹配“54-”和“212-”。
我找到了如下解决方案,但我想知道是否可以对其进行修改,以便它可以提取与字符串列表中的任何元素相匹配的列,例如“Meaning”、“54-”、“ 212-".
zcat test.csv.gz |awk -F, 'NR==1{for(i=1;i<=NF;i++)if($i~/54-/)f[n++]=i}{for(i=0;i<n;i++)printf"%s%s",i?" ":"",$f[i];print""}'
我还想将其保存到 csv.gz 文件中。但是在最后加上 > outputfile.csv
,我不能用逗号分隔。我想知道我应该把 OFS=","
放在这个命令的什么地方?
示例输出如下(在 csv.gz 文件中)
Meaning,54-1.0,54-2.0,212-0.0,212-1.0
Yes,0.4,0.3,0.1,0.6
Other job (free text entry),0,0.7,0.7,0.8
Managers and Senior Officials,0.5,0.2,0.7,0.7
Corporate Managers,0.1,0.7,0.2,0.4
Corporate Managers And Senior Officials,0,0.8,0.4,0.8
Senior officials in national government,0.9,0.6,0.2,0.9
AM (National Assembly),0.8,0.3,0,0.2
Ambassador (Foreign and Commonwealth Office),0.9,0.9,0.1,0.2
Band 0 (Health and Safety Executive),0.6,0.4,0.4,0.8
Band 1B (Meteorological Office),0.6,0.1,1,0.8
谢谢。
希望这有助于根据您的需要更改变量 get
:
One-liner:
$ awk -v get='^(Meaning|54-|212-)' 'BEGIN{FS=OFS=","}FNR==1{for(i=1;i<=NF;i++)if($i~get)cols[++c]=i}{for(i=1; i<=c; i++)printf "%s%s", $(cols[i]), (i<c ? OFS : ORS)}' file
Meaning,54-1.0,54-2.0,212-0.0,212-1.0
Yes,0.4,0.3,0.1,0.6
Other job (free text entry),0,0.7,0.7,0.8
Managers and Senior Officials,0.5,0.2,0.7,0.7
Corporate Managers,0.1,0.7,0.2,0.4
Corporate Managers And Senior Officials,0,0.8,0.4,0.8
Senior officials in national government,0.9,0.6,0.2,0.9
AM (National Assembly),0.8,0.3,0,0.2
Ambassador (Foreign and Commonwealth Office),0.9,0.9,0.1,0.2
Band 0 (Health and Safety Executive),0.6,0.4,0.4,0.8
Band 1B (Meteorological Office),0.6,0.1,1,0.8
你的情况:
$ zcat test.csv.gz | awk -v get='^(Meaning|54-|212-)' 'BEGIN{FS=OFS=","}FNR==1{for(i=1;i<=NF;i++)if($i~get)cols[++c]=i}{for(i=1; i<=c; i++)printf "%s%s", $(cols[i]), (i<c ? OFS : ORS)}'
更好读:
awk -v get='^(Meaning|54-|212-)' '
BEGIN{
FS=OFS=","
}
FNR==1{
for(i=1;i<=NF;i++)
if($i~get)cols[++c]=i
}
{
for(i=1; i<=c; i++)
printf "%s%s", $(cols[i]), (i<c ? OFS : ORS)
}' file
输入:
$ cat file
Coding,Value,Meaning,54-1.0,54-2.0,431-2.0,212-0.0,212-1.0
1,1,Yes,0.4,0.3,0.7,0.1,0.6
2,0,Other job (free text entry),0,0.7,0.3,0.7,0.8
2,1,Managers and Senior Officials,0.5,0.2,0.4,0.7,0.7
2,11,Corporate Managers,0.1,0.7,0.4,0.2,0.4
2,111,Corporate Managers And Senior Officials,0,0.8,0.8,0.4,0.8
2,1111,Senior officials in national government,0.9,0.6,0.4,0.2,0.9
2,1111001,AM (National Assembly),0.8,0.3,0.2,0,0.2
2,1111002,Ambassador (Foreign and Commonwealth Office),0.9,0.9,0.7,0.1,0.2
2,1111003,Band 0 (Health and Safety Executive),0.6,0.4,0,0.4,0.8
2,1111004,Band 1B (Meteorological Office),0.6,0.1,0.6,1,0.8
这是一个 awk
带有解释的脚本。
注意第 3 行,它在 fieldsIdentifierList
变量中定义了字段标识符列表,您可以修改它。
或者使用 -v
命令行选项用作外部输入变量。
script.awk
BEGIN { # pre process initial values
OFS = ","; # set output separator to ","
fieldsIdentifierList = "54-,212-,Meaning"; # list field identifiers
split(fieldsIdentifierList, fieldsIdentifierArr, ","); # create an array from field identifiers
}
NR == 1 { # process only the first line
for(i = 1; i <= NF; i++) # for each field
for(fieldIdentifier in fieldsIdentifierArr) { # and for each field identifiers
if($i ~ fieldsIdentifierArr[fieldIdentifier]) { # if field match field identifier
targetFieldsArr[++n]=i; # append field idx to target fields array
}
}
}
{ # for each line
for(field in targetFieldsArr) # for each target field
printf("%s%s", field > 1? OFS: "", $targetFieldsArr[field]); # print the target field followed by field separatorfollowed by field separator
print ""; # print end of line.
}
运行 script.awk
zcat test.csv.gz |awk -f script.awk
示例输出
$ awk -f script.awk input.txt
Meaning,54-1.0,54-2.0,212-0.0,212-1.0
Yes,0.4,0.3,0.1,0.6
Other,job,(free,entry),0.0
Managers,and,Senior,0.5,0.2
Corporate,Managers,0.1,0.4,0.2
Corporate,Managers,And,Officials,0.0
Senior,officials,in,government,0.9
AM,(National,Assembly),0.3,0.2
Ambassador,(Foreign,and,Office),0.9
Band,0,(Health,Safety,Executive)
Band,1B,(Meteorological,0.6,0.1
@嘟嘟小子,我用上面的脚本得到了这样的结果
Coding Value Meaning 54-1.0 54-2.0 431-2.0 212-0.0 212-1.0
Coding Value Meaning 54-1.0 54-2.0 431-2.0 212-0.0 212-1.0
Coding Value Meaning 54-1.0 54-2.0 431-2.0 212-0.0 212-1.0
1 1 Yes 0.4 0.3 0.7 0.1 0.6
1 1 Yes 0.4 0.3 0.7 0.1 0.6
1 1 Yes 0.4 0.3 0.7 0.1 0.6
2 0 Other 2 0 Other 2 0 Other
2 1 Managers 2 1 Managers 2 1 Managers
2 11 Corporate 2 11 Corporate 2 11 Corporate
2 111 Corporate 2 111 Corporate 2 111 Corporate
2 1111 Senior 2 1111 Senior 2 1111 Senior
2 1111001 AM 2 1111001 AM 2 1111001 AM
2 1111002 Ambassador 2 1111002 Ambassador 2 1111002 Ambassador
2 1111003 Band 2 1111003 Band 2 1111003 Band
2 1111004 Band 2 1111004 Band 2 1111004 Band