awk - 如果满足条件则计算百分比

awk - calculate percentages if condition is met

我从 awk 开始,不知道在满足特定条件时如何计算百分比。

这是我正在使用的文件:

user,gender,age,native_lang,other_lang
0,M,19,finnish,english swedish german 
1,M,30,urdu,english 
2,F,26,finnish,english swedish german
3,M,20,finnish,english french swedish 
4,F,20,finnish,english swedish 
5,F,29,finnish,english 
6,F,23,swedish,finnish english 
7,F,19,swedish,finnish english french 
8,F,25,finnish,english swedish german russian french estonian

我要根据条件计算百分比:

我写的脚本如下:

awk -F ',' {~/finnish/ && ~/swedish/}END{for (i in a)}

给定这些行的预期输出应该是 44.44%

我找不到在计算总计的变量中添加“+1”的方法。

怎么做到的?

谢谢

++ 递增一个变量以获得匹配数。最后除以 NR-1,即输入行数(不包括 header)。

执行块的条件不在 {} 内,而是在

之前

脚本参数需要用引号引起来。

awk -F ',' '~/finnish/ && ~/swedish/ {count++} 
            END {printf("%.2f%%\n", 100*count/(NR-1))}' filename.csv

假设:

  • objective是计算匹配给定native/other语言对匹配
  • 的输入行的百分比
  • 输入分隔符为逗号
  • native 匹配在第 4 个输入字段
  • other 匹配第 5 个输入字段中的一个词(多个词用白色分隔 space)
  • 比较应该不区分大小写
  • 示例输入中有五个 finnish/swedish 匹配项,因此结果应为 55.56%(与 OP 建议的 44.44% 相反)
  • 不需要担心由多个单词组成的语言(回复:EdMorton 的评论)
  • 逗号分隔符旁边没有 'extra' 白色 space(否则我们需要从逗号分隔符 trim leading/trailing 白色 space -分隔的字段)

一个awk想法:

native='finnish'
other='swedish'

awk -v native="${native}" -v other="${other}" -F"," '

BEGIN  { native = tolower(native)                 # convert everything to lower case
         other  = tolower(other)                  # to simulate case-insensitive matching
       }

FNR==1 { next }                                   # skip header; just in case "native" or "other" have a match in this line

tolower() == native {                           # case-insensitive match on field #4?

         n=split(tolower(),a,"[[:space:]]")     # case-insensitive split of field #5 into components; should address EdMorton comment about substring matching multiple languages

         for (i=1;i<=n;i++)                       # loop through array looking for matches
             if (other == a[i]) {                 # and if found ...
                count++                           # increment our counter and ...
                next                              # skip to next input line; do not want to double count if there is a dupe in field #5
             }
       }

END    { if (NR >= 2)                             # as long as we have at least one data line ...
            printf "%.2f%\n", 100*count/(NR-1)    # print the % of input lines that match the "native/other" pair
       }
' users.dat

这会生成:

55.56%