awk

Question

我从 awk 开始，不知道在满足特定条件时如何计算百分比。

这是我正在使用的文件：

user,gender,age,native_lang,other_lang
0,M,19,finnish,english swedish german 
1,M,30,urdu,english 
2,F,26,finnish,english swedish german
3,M,20,finnish,english french swedish 
4,F,20,finnish,english swedish 
5,F,29,finnish,english 
6,F,23,swedish,finnish english 
7,F,19,swedish,finnish english french 
8,F,25,finnish,english swedish german russian french estonian

我要根据条件计算百分比：

native_lang = 'finnish'
other_lang = 'swedish'

我写的脚本如下：

awk -F ',' {~/finnish/ && ~/swedish/}END{for (i in a)}

给定这些行的预期输出应该是 44.44%

我找不到在计算总计的变量中添加“+1”的方法。

怎么做到的？

谢谢

Answer 1

用 ++ 递增一个变量以获得匹配数。最后除以 NR-1，即输入行数（不包括 header）。

执行块的条件不在 {} 内，而是在

之前

脚本参数需要用引号引起来。

awk -F ',' '~/finnish/ && ~/swedish/ {count++} 
            END {printf("%.2f%%\n", 100*count/(NR-1))}' filename.csv

Answer 2

假设：

objective是计算匹配给定native/other语言对匹配
输入分隔符为逗号
native 匹配在第 4 个输入字段
other 匹配第 5 个输入字段中的一个词（多个词用白色分隔 space）
比较应该不区分大小写
示例输入中有五个 finnish/swedish 匹配项，因此结果应为 55.56%（与 OP 建议的 44.44% 相反）
不需要担心由多个单词组成的语言（回复：EdMorton 的评论）
逗号分隔符旁边没有 'extra' 白色 space（否则我们需要从逗号分隔符 trim leading/trailing 白色 space -分隔的字段）

一个awk想法：

native='finnish'
other='swedish'

awk -v native="${native}" -v other="${other}" -F"," '

BEGIN  { native = tolower(native)                 # convert everything to lower case
         other  = tolower(other)                  # to simulate case-insensitive matching
       }

FNR==1 { next }                                   # skip header; just in case "native" or "other" have a match in this line

tolower() == native {                           # case-insensitive match on field #4?

         n=split(tolower(),a,"[[:space:]]")     # case-insensitive split of field #5 into components; should address EdMorton comment about substring matching multiple languages

         for (i=1;i<=n;i++)                       # loop through array looking for matches
             if (other == a[i]) {                 # and if found ...
                count++                           # increment our counter and ...
                next                              # skip to next input line; do not want to double count if there is a dupe in field #5
             }
       }

END    { if (NR >= 2)                             # as long as we have at least one data line ...
            printf "%.2f%\n", 100*count/(NR-1)    # print the % of input lines that match the "native/other" pair
       }
' users.dat

这会生成：

55.56%

awk - 如果满足条件则计算百分比

awk - calculate percentages if condition is met