如何在awk中设置空单元格的平均年龄

how to set the average age on the empty cells in awk

我正在使用的数据集如下:

$ cat file
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S

我写了一个 awk 脚本来将年龄列中的空单元格替换为其他人的平均值。

代码如下:

$ cat tst.awk
BEGIN{FS=OFS=","}
     NR==FNR &&
     {sum+=;
     elementos++;
     next}
     !{=media}
     {print}
     ENDFILE{media=sum/elementos}

给出的结果如下:

$ awk -f tst.awk file
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q

如您所见,代码仅显示添加了年龄的那些行,但并未显示所有值。 除此之外,带有标题的第一行也被删除了。

期望值,即:

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,44.5,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S

请注意样本的平均年龄为 44.5,因此显示在行中:6,0,3,"Moran, Mr. James",male,*44.5*,0,0,330877,8.4583,,Q

这里有什么问题? 我需要用一个循环来完成它并使用 awk.


原问题:

我正在使用的数据集如下:

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S

我写了一个 awk 脚本来将年龄列中的空单元格替换为其他人的平均值。

代码如下:

BEGIN{FS=OFS=","}
     NR==FNR && 
     {sum+=; 
     elementos++; 
     next}
     !{=media}
     {print > "/tmp/train4.csv" }
     ENDFILE{media=sum/elementos} 

给出的结果如下:

6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13,,S
20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C
27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.225,,C
29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
30,0,3,"Todoroff, Mr. Lalio",male,,0,0,349216,7.8958,,S
32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C
33,1,3,"Glynn, Miss. Mary Agatha",female,,0,0,335677,7.75,,Q
37,1,3,"Mamee, Mr. Hanna",male,,0,0,2677,7.2292,,C
43,0,3,"Kraeff, Mr. Theodor",male,,0,0,349253,7.8958,,C
46,0,3,"Rogers, Mr. William John",male,,0,0,S.C./A.4. 23567,8.05,,S

如您所见,代码仅显示添加了年龄的那些行,但并未显示所有值。 除此之外,带有标题的第一行也被删除了。

期望值,即:

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,31.2,0,0,330877,8.4583,,Q
...

请注意,6 行样本的平均年龄为 31.2,因此显示在第 6 行中:6,0,3,"Moran, Mr. James",male,*31.2*,0,0,330877,8.4583,,Q

这里有什么问题? 我需要用一个循环来完成它并使用 awk.

Assumptions/Understandings(来自 OP 的评论):

  • 所有 Name 数据都包含一个嵌入式逗号,因此使用定义为字段分隔符的逗号,Age 列实际上是字段 #7
  • 平均值的输出格式Age在小数点右边包含一个数字
  • 此时输入文件大小未知,因此为了避免 运行 成为潜在的内存问题,我们将研究一个 awk 解决方案,该解决方案对输入文件进行 2 次传递

一个awk想法:

awk '
BEGIN   { FS=OFS="," }                              # input/output field delimiter is comma

# FNR==NR ==> process 1st input file

FNR==NR { if (FNR > 1)                              # ignore header row
             if (+0 == ) {                      # if field #7 is non-empty and a number then ...
                elementos++                         # keep track of number of non-empty fields
                sum+=                             # add to our running sum
             }
          next
        }

# the rest of this script is for processing the 2nd input file

FNR==1  { media = 0                                 # while processing the header go ahead and determine the average
          if (elementos>0) 
             media = sprintf("%.1f", sum/elementos)
          print                                     # print the header row
          next                                      # skip to the next line of input
        }
        { if (=="")                               # if field #7 is empty ...
              = media                             # set field #7 to the average
          print                                     # print the current line
        }
' input.csv input.csv > output.csv

仅使用 OP 的示例 8 行输入文件生成:

$ cat output.csv
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,35.0,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S

$ diff input.csv output.csv
7c8
< 6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
---
> 6,0,3,"Moran, Mr. James",male,35.0,0,0,330877,8.4583,,Q

diff 输出我们看到 PassengerID=6 空白 Age 列更新为 35.0.

的平均值

假设 Name 列可能并不总是包含单个嵌入式逗号,OP 将希望查看可以处理这种情况的解决方案。一种想法是查看 GNU awk / FPAT 功能。

将 GNU awk 用于 FPAT(您必须已经将其用于 ENDFILE):

$ cat tst.awk
BEGIN {
    FPAT = "([^,]*)|(\"[^\"]*\")"
    OFS = ","
    ARGV[ARGC++] = ARGV[1]
}
NR == FNR {
    if ( FNR>1 &&  ) {
        sum += 
        elementos++
    }
    next
}
FNR == 1 {
    media = ( elementos ? sum / elementos : 0 )
}
! {
     = media
}
{ print }

$ awk -f tst.awk file
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,44.5,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S

我假设您希望 elementos 是填充了 age 的数据行的计数,而不是数据行的总数,并且年龄为 0 应该被处理与失踪年龄相同。

三元表达式 elementos ? sum / elementos : 0 是必需的,这样如果输入中不存在非零 age,您就不会得到被零除的错误。

我喜欢 ruby 快速的 CSV 内容:

ruby -rcsv -e '
    data = CSV.read(ARGV.shift)
    col = data[0].index("Age")
    ages = data
            .drop(1)
            .map {|row| row[col]}
            .reject(&:nil?)
            .map(&:to_i)
    media = ages.sum / ages.size
    data.each {|row| 
        row[col] ||= media
        puts CSV.generate_line(row)
    }
' file