awk - 如果满足条件则计算百分比
awk - calculate percentages if condition is met
我从 awk
开始,不知道在满足特定条件时如何计算百分比。
这是我正在使用的文件:
user,gender,age,native_lang,other_lang
0,M,19,finnish,english swedish german
1,M,30,urdu,english
2,F,26,finnish,english swedish german
3,M,20,finnish,english french swedish
4,F,20,finnish,english swedish
5,F,29,finnish,english
6,F,23,swedish,finnish english
7,F,19,swedish,finnish english french
8,F,25,finnish,english swedish german russian french estonian
我要根据条件计算百分比:
native_lang
= 'finnish'
other_lang
= 'swedish'
我写的脚本如下:
awk -F ',' {~/finnish/ && ~/swedish/}END{for (i in a)}
给定这些行的预期输出应该是 44.44%
我找不到在计算总计的变量中添加“+1”的方法。
怎么做到的?
谢谢
用 ++
递增一个变量以获得匹配数。最后除以 NR-1
,即输入行数(不包括 header)。
执行块的条件不在 {}
内,而是在
之前
脚本参数需要用引号引起来。
awk -F ',' '~/finnish/ && ~/swedish/ {count++}
END {printf("%.2f%%\n", 100*count/(NR-1))}' filename.csv
假设:
- objective是计算匹配给定
native/other
语言对匹配 的输入行的百分比
- 输入分隔符为逗号
native
匹配在第 4 个输入字段
other
匹配第 5 个输入字段中的一个词(多个词用白色分隔 space)
- 比较应该不区分大小写
- 示例输入中有五个
finnish/swedish
匹配项,因此结果应为 55.56%
(与 OP 建议的 44.44%
相反)
- 不需要担心由多个单词组成的语言(回复:EdMorton 的评论)
- 逗号分隔符旁边没有 'extra' 白色 space(否则我们需要从逗号分隔符 trim leading/trailing 白色 space -分隔的字段)
一个awk
想法:
native='finnish'
other='swedish'
awk -v native="${native}" -v other="${other}" -F"," '
BEGIN { native = tolower(native) # convert everything to lower case
other = tolower(other) # to simulate case-insensitive matching
}
FNR==1 { next } # skip header; just in case "native" or "other" have a match in this line
tolower() == native { # case-insensitive match on field #4?
n=split(tolower(),a,"[[:space:]]") # case-insensitive split of field #5 into components; should address EdMorton comment about substring matching multiple languages
for (i=1;i<=n;i++) # loop through array looking for matches
if (other == a[i]) { # and if found ...
count++ # increment our counter and ...
next # skip to next input line; do not want to double count if there is a dupe in field #5
}
}
END { if (NR >= 2) # as long as we have at least one data line ...
printf "%.2f%\n", 100*count/(NR-1) # print the % of input lines that match the "native/other" pair
}
' users.dat
这会生成:
55.56%
我从 awk
开始,不知道在满足特定条件时如何计算百分比。
这是我正在使用的文件:
user,gender,age,native_lang,other_lang
0,M,19,finnish,english swedish german
1,M,30,urdu,english
2,F,26,finnish,english swedish german
3,M,20,finnish,english french swedish
4,F,20,finnish,english swedish
5,F,29,finnish,english
6,F,23,swedish,finnish english
7,F,19,swedish,finnish english french
8,F,25,finnish,english swedish german russian french estonian
我要根据条件计算百分比:
native_lang
= 'finnish'other_lang
= 'swedish'
我写的脚本如下:
awk -F ',' {~/finnish/ && ~/swedish/}END{for (i in a)}
给定这些行的预期输出应该是 44.44%
我找不到在计算总计的变量中添加“+1”的方法。
怎么做到的?
谢谢
用 ++
递增一个变量以获得匹配数。最后除以 NR-1
,即输入行数(不包括 header)。
执行块的条件不在 {}
内,而是在
脚本参数需要用引号引起来。
awk -F ',' '~/finnish/ && ~/swedish/ {count++}
END {printf("%.2f%%\n", 100*count/(NR-1))}' filename.csv
假设:
- objective是计算匹配给定
native/other
语言对匹配 的输入行的百分比
- 输入分隔符为逗号
native
匹配在第 4 个输入字段other
匹配第 5 个输入字段中的一个词(多个词用白色分隔 space)- 比较应该不区分大小写
- 示例输入中有五个
finnish/swedish
匹配项,因此结果应为55.56%
(与 OP 建议的44.44%
相反) - 不需要担心由多个单词组成的语言(回复:EdMorton 的评论)
- 逗号分隔符旁边没有 'extra' 白色 space(否则我们需要从逗号分隔符 trim leading/trailing 白色 space -分隔的字段)
一个awk
想法:
native='finnish'
other='swedish'
awk -v native="${native}" -v other="${other}" -F"," '
BEGIN { native = tolower(native) # convert everything to lower case
other = tolower(other) # to simulate case-insensitive matching
}
FNR==1 { next } # skip header; just in case "native" or "other" have a match in this line
tolower() == native { # case-insensitive match on field #4?
n=split(tolower(),a,"[[:space:]]") # case-insensitive split of field #5 into components; should address EdMorton comment about substring matching multiple languages
for (i=1;i<=n;i++) # loop through array looking for matches
if (other == a[i]) { # and if found ...
count++ # increment our counter and ...
next # skip to next input line; do not want to double count if there is a dupe in field #5
}
}
END { if (NR >= 2) # as long as we have at least one data line ...
printf "%.2f%\n", 100*count/(NR-1) # print the % of input lines that match the "native/other" pair
}
' users.dat
这会生成:
55.56%