在列内添加逗号分隔值

Question

您好，我有这样的文件格式 (TSV)

Name  type    Age     Weight       Height 
Xxx   M    12,34,23  50,30,60,70   4,5,6,5.5 
Yxx   F    21,14,32  40,50,20,40   3,4,5,5.5

我想添加年龄、体重和身高中的所有值，并在此之后添加一列，然后也添加一些百分比，例如 Total_Height/Total_Weight (awk '$0=$0"\t"(NR ==1?"百分比":$8/$7)')。我有大量数据集，无法使用 excel.

像这样

Name  type    Age     Weight       Height     Total_Age Total_Weight Total_Height Percentage
Xxx   M    12,34,23  50,30,60,70   4,5,6,5.5   69        210         20.5          0.097            
Yxx   F    21,14,32  40,50,20,40   3,4,5,5.5   67        150         17.5          0.11

Answer 1

我会使用 GNU AWK 的函数 split 来完成这个任务，如下所示。考虑以下简单示例，令 file.txt 内容为

Name  type    Age     Weight       Height 
Xxx   M    12,34,23  50,30,60,70   4,5,6,5.5 
Yxx   F    21,14,32  40,50,20,40   3,4,5,5.5

然后

awk 'BEGIN{OFS="\t"}NR==1{print "Age","Total"}NR>1{totalage=0;split(,ages,",");for(a in ages){totalage+=ages[a]};print ,totalage}' file.txt

输出

Age Total
12,34,23    69
21,14,32    67

说明：首先我通知 GNU AWK 使用制表符作为输出字段分隔符 (OFS)，然后对于第一行我打印 headers，对于接下来的每一行我：设置totalage 值到 0，将第 3 列的内容拆分到 , 处的数组 ages，遍历该数组获取其值的总和，然后是第 3 列的 print 内容列和总和。注意

Before splitting the string, split() deletes any previously existing elements in the arrays array and seps.

所以它不需要像 totalage 变量那样重新设置。

（在 gawk 4.2.1 中测试）

Answer 2

使用您显示的示例，请尝试以下代码。

awk '
FNR==1{
  print [=10=],"Total_Age Total_Weight Total_Height Percentage"
  next
}
FNR>1{
  totAge=totWeight=totHeight=0
  split(,tmp,",")
  for(i in tmp){
    totAge+=tmp[i]
  }
  split(,tmp,",")
  for(i in tmp){
    totWeight+=tmp[i]
  }
  split(,tmp,",")
  for(i in tmp){
    totHeight+=tmp[i]
  }
  $(NF+1)=totAge
  $(NF+1)=totWeight
  $(NF+1)=totHeight
  $(NF+1)=$(NF-1)==0?"N/A":$NF/$(NF-1)
}
1' Input_file | column -t

OR 添加上面 awk 代码的简短版本：

awk '
BEGIN{OFS="\t"}
FNR==1{
  print [=11=],"Total_Age Total_Weight Total_Height Percentage"
  next
}
FNR>1{
  totAge=totWeight=totHeight=0
  split(,tmp,",")
  for(i in tmp){
    totAge+=tmp[i]
  }
  split(,tmp,",")
  for(i in tmp){
    totWeight+=tmp[i]
  }
  split(,tmp,",")
  for(i in tmp){
    totHeight+=tmp[i]
  }
  $(NF+1)=totAge OFS totWeight OFS totHeight
  [=11=]=[=11=]
  $(NF+1)=( $(NF-1)==0 ? "N/A" : $NF/$(NF-1) )
}
1' Input_file | column -t

解释： 简单的解释是，将第 3、4 和 5 列的总和分配给行的最后一列。因此，根据 OP 的要求添加列值，该值具有最后一列和倒数第二列的除法值。使用 column -t 使其在输出时看起来更好。

Answer 3

如果您必须多次执行相同的操作，您还可以使用函数对数组值求和（假设这些值是用逗号分隔的数字）。

重用来自@RavinderSingh13 and a massive thank you to @Ed Morton 的部分答案花时间提供很好的反馈改进代码：

awk '
function arraySum(field,      sum,arr,i) {
  split(field,arr,",")
  for (i in arr) sum += arr[i]
  return sum
}
FNR==1{
  print [=10=], "Total_Age", "Total_Weight", "Total_Height", "Percentage"
  next
}
NR > 1 {
  sumWeight = arraySum()
  sumHeight = arraySum()
  print [=10=], arraySum(), sumWeight, sumHeight, (sumWeight ? sumHeight/sumWeight : 0)
}' file | column -t

输出

Name  type  Age       Weight       Height     Total_Age  Total_Weight  Total_Height  Percentage
Xxx   M     12,34,23  50,30,60,70  4,5,6,5.5  69         210           20.5          0.097619
Yxx   F     21,14,32  40,50,20,40  3,4,5,5.5  67         150           17.5          0.116667

Answer 4

在每个 Unix 机器上的任何 shell 中使用任何 awk 并且不在每条记录中创建新字段（这是低效的，因为它会导致 awk 在每次更改字段时重新构建记录）并且没有更新输入记录（这是低效的，因为它会导致 awk 在每次更改记录时将记录重新拆分为字段）并设计为以任何顺序处理任意数量的值输入列：

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ printf "%s%s", [=10=], OFS }
NR==1 {
    for (i=3; i<=NF; i++) {
        printf "Total_%s%s", $i, OFS
        tags[i] = $i
    }
    print "Percentage"
    next
}
{
    delete tot
    for (i=3; i<=NF; i++) {
        tag = tags[i]
        n = split($i,vals,",")
        for (j in vals) {
            tot[tag] += vals[j]
        }
        printf "%s%s", tot[tag], OFS
    }
    printf "%0.3f%s", (tot["Weight"] ? tot["Height"] / tot["Weight"] : 0), ORS
}

$ awk -f tst.awk file
Name    type    Age     Weight  Height  Total_Age       Total_Weight    Total_Height    Percentage
Xxx     M       12,34,23        50,30,60,70     4,5,6,5.5       69      210     20.5    0.098
Yxx     F       21,14,32        40,50,20,40     3,4,5,5.5       67      150     17.5    0.117

$ awk -f tst.awk file | column -t
Name  type  Age       Weight       Height     Total_Age  Total_Weight  Total_Height  Percentage
Xxx   M     12,34,23  50,30,60,70  4,5,6,5.5  69         210           20.5          0.098
Yxx   F     21,14,32  40,50,20,40  3,4,5,5.5  67         150           17.5          0.117

为了显示上述方法的功能优势，假设您需要添加更多值，例如 ShoeSize and/or 重新排列列的顺序，例如：

$ column -t file
Name  type  ShoeSize  Height     Age       Weight
Xxx   M     12,8,10   4,5,6,5.5  12,34,23  50,30,60,70
Yxx   F     9,7,8     3,4,5,5.5  21,14,32  40,50,20,40

现在运行上面的脚本，注意你为每个原始列添加了 Total_ 列，你仍然得到 Height/Weight 的相同 Percentage 列添加到结束：

$ awk -f tst.awk file | column -t
Name  type  ShoeSize  Height     Age       Weight       Total_ShoeSize  Total_Height  Total_Age  Total_Weight  Percentage
Xxx   M     12,8,10   4,5,6,5.5  12,34,23  50,30,60,70  30              20.5          69         210           0.098
Yxx   F     9,7,8     3,4,5,5.5  21,14,32  40,50,20,40  24              17.5          67         150           0.117

Answer 5

这里有一个 Ruby，对于多个数据字段来说更容易一些，例如：

ruby -F"\t" -lane '
if ($.==1) 
    puts "Name\ttype\tAge\tWeight\tHeight\tTotal_Age\tTotal_Weight\tTotal_Height\tPercentage"
    next 
end
fields=$F.clone
$F.each{|f| fields.append(f.split(/,/).map(&:to_f).sum) if f[/^[\d,.]+$/] && f[/,/]}
fields.append((fields[-1]/fields[-2]).round(3))
puts fields.join("\t")' file | column -t

打印：

Name  type  Age       Weight       Height     Total_Age  Total_Weight  Total_Height  Percentage
Xxx   M     12,34,23  50,30,60,70  4,5,6,5.5  69.0       210.0         20.5          0.098
Yxx   F     21,14,32  40,50,20,40  3,4,5,5.5  67.0       150.0         17.5          0.117

这里的优点是 n.nn,n.nn,... 的列总和可以按照找到的顺序灵活地添加到行的末尾。

在列内添加逗号分隔值

Add comma sepated values inside a column

python

awk