AWK:像输入文件一样保持字段间距

AWK: Maintain field spacing like input file

我在下面的测试文件中模拟我的问题:

# cat out 
2014-01-10 18:23:25          0 Andy/ADPTER/
2014-01-10 18:23:36        503 Sandy/ADPTER/ACCOUNTTYPE MAP.csv
2014-01-10 18:23:38        516 John/ADPTER/CITY MAP.csv
2014-01-10 18:23:38        398 Wendy/ADPTER/COUNTRY MAP.csv
2014-01-10 18:23:38      11117 Andy/ADPTER/CURRENCY MAP.csv
2014-01-10 18:23:38        260 Sandy/ADPTER/GENDER MAP.csv
2014-01-10 18:23:39        466 John/ADPTER/STATE MAP.csv
2014-01-10 18:23:40        373 Jim/ADPTER/UNITS MAP.csv

这是我的 Bash 变量:

# echo $bucket
bucket_name

因此,在上面的文件中,我希望将 Bash 变量值作为第 4 个字段的前缀。

这是我想要的输出:

2014-01-10 18:23:25          0 bucket_name/Andy/ADPTER/
2014-01-10 18:23:36        503 bucket_name/Sandy/ADPTER/ACCOUNTTYPE MAP.csv
2014-01-10 18:23:38        516 bucket_name/John/ADPTER/CITY MAP.csv
2014-01-10 18:23:38        398 bucket_name/Wendy/ADPTER/COUNTRY MAP.csv
2014-01-10 18:23:38      11117 bucket_name/Andy/ADPTER/CURRENCY MAP.csv
2014-01-10 18:23:38        260 bucket_name/Sandy/ADPTER/GENDER MAP.csv
2014-01-10 18:23:39        466 bucket_name/John/ADPTER/STATE MAP.csv
2014-01-10 18:23:40        373 bucket_name/Jim/ADPTER/UNITS MAP.csv

这是我试过的:

# awk -v var=$bucket '{=var"/"; print}' out 
2014-01-10 18:23:25 0 bucket_name/Andy/ADPTER/
2014-01-10 18:23:36 503 bucket_name/Sandy/ADPTER/ACCOUNTTYPE MAP.csv
2014-01-10 18:23:38 516 bucket_name/John/ADPTER/CITY MAP.csv
2014-01-10 18:23:38 398 bucket_name/Wendy/ADPTER/COUNTRY MAP.csv
2014-01-10 18:23:38 11117 bucket_name/Andy/ADPTER/CURRENCY MAP.csv
2014-01-10 18:23:38 260 bucket_name/Sandy/ADPTER/GENDER MAP.csv
2014-01-10 18:23:39 466 bucket_name/John/ADPTER/STATE MAP.csv
2014-01-10 18:23:40 373 bucket_name/Jim/ADPTER/UNITS MAP.csv

问题:

我的 awk 命令可以满足我的需要,但是,它弄乱了外场间距(分隔符 ??)。我的意图是 just 前缀 bucket_name/ 到第 4 个字段,并保持输入文件具有的任何间距方案(包括 right/left 对齐字段)。

这是我的又一次尝试:

# awk -v var=$bucket 'BEGIN{OFS="\t"}{=var"/"; print}' out 
2014-01-10  18:23:25    0   bucket_name/Andy/ADPTER/
2014-01-10  18:23:36    503 bucket_name/Sandy/ADPTER/ACCOUNTTYPE    MAP.csv
2014-01-10  18:23:38    516 bucket_name/John/ADPTER/CITY    MAP.csv
2014-01-10  18:23:38    398 bucket_name/Wendy/ADPTER/COUNTRY    MAP.csv
2014-01-10  18:23:38    11117   bucket_name/Andy/ADPTER/CURRENCY    MAP.csv
2014-01-10  18:23:38    260 bucket_name/Sandy/ADPTER/GENDER MAP.csv
2014-01-10  18:23:39    466 bucket_name/John/ADPTER/STATE   MAP.csv
2014-01-10  18:23:40    373 bucket_name/Jim/ADPTER/UNITS    MAP.csv

但这也无济于事。

谢谢。

你可以使用 sed。

$ bucket='bucket_name'
$ sed "s~^\(\([^[:blank:]]\+[[:blank:]]\+\)\{3\}\)~$bucket/~" file
2014-01-10 18:23:25          0 bucket_name/Andy/ADPTER/
2014-01-10 18:23:36        503 bucket_name/Sandy/ADPTER/ACCOUNTTYPE MAP.csv
2014-01-10 18:23:38        516 bucket_name/John/ADPTER/CITY MAP.csv
2014-01-10 18:23:38        398 bucket_name/Wendy/ADPTER/COUNTRY MAP.csv
2014-01-10 18:23:38      11117 bucket_name/Andy/ADPTER/CURRENCY MAP.csv
2014-01-10 18:23:38        260 bucket_name/Sandy/ADPTER/GENDER MAP.csv
2014-01-10 18:23:39        466 bucket_name/John/ADPTER/STATE MAP.csv
2014-01-10 18:23:40        373 bucket_name/Jim/ADPTER/UNITS MAP.csv

[[:blank:]]\+ posix 字符 class 匹配任何类型的水平白色-space 字符一次或多次。 [^[:blank:]]\+ POSIX 否定字符 class 匹配任何字符但不是白色 space 一次或多次。

你可以使用这个awk:

bucket="bucket_name"
awk --re-interval -v b="$bucket" '{sub(/([^[:blank:]]+[[:blank:]]+){3}/, 
     "&" b "/")} 1' file
2014-01-10 18:23:25          0 bucket_name/Andy/ADPTER/
2014-01-10 18:23:36        503 bucket_name/Sandy/ADPTER/ACCOUNTTYPE MAP.csv
2014-01-10 18:23:38        516 bucket_name/John/ADPTER/CITY MAP.csv
2014-01-10 18:23:38        398 bucket_name/Wendy/ADPTER/COUNTRY MAP.csv
2014-01-10 18:23:38      11117 bucket_name/Andy/ADPTER/CURRENCY MAP.csv
2014-01-10 18:23:38        260 bucket_name/Sandy/ADPTER/GENDER MAP.csv
2014-01-10 18:23:39        466 bucket_name/John/ADPTER/STATE MAP.csv
2014-01-10 18:23:40        373 bucket_name/Jim/ADPTER/UNITS MAP.csv

Online Working Demo

-v b="$bucket"                 # pass a value to awk in variable b
--re-interval                  # Enable the use of interval
                               # expressions in regular expression matching
sub                            # match input using regex and substitute with
                               # the given string
([^[:blank:]]+[[:blank:]]+){3} # match first 3 fields of the line separated by space/tab
 "&" b "/"                     # replace by matched string + var b + /

编辑:(感谢@EdMorton)要使其适用于参数中的任何值(例如,如果 bucket="&" 则尝试两种解决方案)使用:

awk --re-interval -v b="$bucket" 'match([=12=], /([^[:blank:]]+[[:blank:]]+){3}/) {
    [=12=] = substr([=12=], 1, RLENGTH) b "/" substr([=12=], RLENGTH+1) } 1' file

您在 OP 中标记了 Perl,因此有一个 Perl 解决方案:

perl -pe'BEGIN{$var=shift}s,(?:.*?\s+){3}\K,$var/,' "$bucket" out

它在技术上与使用 sed 相同的解决方案,但好处是它避免了转义问题。 Shell 变量 $bucket 可以包含任何内容。

这在 awk 中有点棘手,但有一个相关的 GNU 扩展:在 gawk 中,split 函数采用可选的第四个参数来保存实际的字段分隔符以备后用。使用那个:

gawk -v bucket="$bucket" '{ split([=10=], f, FS, d); d[NF] = ORS; f[4] = bucket "/" f[4]; for(i = 1; i <= NF; ++i) printf("%s%s", f[i], d[i]); }' filename

即:

{
  split([=11=], f, FS, d)             # split line into fields, saving fields in
                                  # the f and delimiters in the d array
  d[NF] = ORS                     # for the newline at the end
  f[4] = bucket "/" f[4]          # fix fourth field
  for(i = 1; i <= NF; ++i) {      # then print the fields separated by the
    printf("%s%s", f[i], d[i]);   # saved delimiters
  }
}

附录: 我真的不推荐用 sed 做这个,除非变量来自可靠的来源并且保证不包含元字符(否则你 有代码注入问题)。也就是说:使用 sed 的一种简单方法是

sed "s|[[:space:]]\+|&${bucket}/|3" filename

...将 ${bucket} 附加到第三次出现的 [[:space:]]\+

如果您要坚持使用 awk,明确给出格式字符串可能是最简单的方法:

awk '{printf "%s %s %10s %s/%s\n", , , , b, }' b="$bucket" out