如何提高脚本的执行速度
How can I improve the execution speed of my script
我有一个包含大约 1500 万条记录的文件。以下是数据示例
99001597,555555555555,3211,Njro_Kaniani,test,NORTH,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,IN2017,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001679,555555555555,1756,Bnju_HTT,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2012,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001680,555555555555,1108,Temoni_Kiara,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2028,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001683,555555555555,1604,Blue_Bay,Nzindo,,Y,COAST,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1820,Sgerea_Makuka,Salaam,,N,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1184,Makka,Salaam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1381,Leaders_Club,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1037,Mbez,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1041,Ngano,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1313,Kichangani,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001684,555555555555,4975,Nyugusu Campp2,Test,test,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001684,555555555555,1041,Ngano,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001684,555555555555,1420,Airport_Macro,Salaam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001684,555555555555,3147,Technical_Nzoti,test,ORTH,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001684,555555555555,4488,Lumala,Mwnza,,Y,Nyeka,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001684,555555555555,4975,Nyarugusu Campp2,Kigoma,,Y,Nyeka,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google
我正在使用下面的脚本来计算符合特定条件的行的出现次数。问题是这个脚本很慢。我一天得到大约 200 条输出线。
目前,我的程序将读取 1500 万记录文件 36,000 次。这是非常低效的(慢!!)。我如何修改我的脚本以只读取一次非常大的文件?
期望的输出
1037,0,0,1,1,1,1,1,1,1,1,1,1
1041,0,0,2,2,2,2,2,2,2,2,2,2
1108,0,0,1,1,1,1,1,1,1,1,1,1
1184,0,0,1,1,1,1,1,1,1,1,1,1
1313,0,0,1,1,1,1,1,1,1,1,1,1
1381,0,0,1,1,1,1,1,1,1,1,1,1
1420,0,0,1,1,1,1,1,1,1,1,1,1
1604,0,0,1,1,1,1,1,1,1,1,1,1
1756,0,0,1,1,1,1,1,1,1,1,1,1
1820,0,0,1,1,1,1,1,1,1,1,1,1
3147,0,0,1,1,1,1,1,0,0,0,0,1
3211,0,0,1,1,1,1,1,0,0,0,0,1
4488,0,0,1,1,1,1,1,1,1,1,1,1
4975,0,0,2,2,2,2,2,1,1,0,0,1
IDs_file 文件包含大约 3000 条记录,每条记录一个 4 位数字
while read i
do
twog=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if ((( == "Yes")||( == "No")) && ( == src) && ( == "No")&& ( == "No")) print [=12=];}'|wc -l)
threeg=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F,'{if (( == src) &&( == "Yes")&& ( == "No")) print [=12=];}'|wc -l)
fourg=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F,'{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
lte2100=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
lte800=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
lte700=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
lte1800=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
lte2600=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
lte900=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
threeg2100=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
threeg900=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
volte=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
echo $i,$twog,$threeg,$fourg,$lte2100,$lte800,$lte700,$lte1800,$lte2600,$lte900,$threeg2100,$threeg900,$volte>>Raw_data_for_report.csv
done < IDs_file
解决方案:将所有循环放在一个 awk 程序中:
#! /usr/bin/awk -f
BEGIN {
FS=OFS=","
if (src=="") {
exit
}
split(src,arr_src,",")
for (i in arr_src) {
src=arr_src[i]
twog[src]=threeg[src]=fourg[src]=lte2100[src]=lte800[src]=lte700[src]=lte1800[src]=lte2600[src]=lte900[src]=threeg2100[src]=threeg900[src]=volte[src]=0
}
}
{
for (i in arr_src) {
src=arr_src[i]
if ( == src) {
if (( == "Yes" || == "No") && == "No" && == "No") twog[src]++
if ( == "Yes" && == "No") threeg[src]++
if ( == "Yes") fourg[src]++
if ( == "Yes") lte2100[src]++
if ( == "Yes") lte800[src]++
if ( == "Yes") lte700[src]++
if ( == "Yes") lte1800[src]++
if ( == "Yes") lte2600[src]++
if ( == "Yes") lte900[src]++
if ( == "Yes") threeg2100[src]++
if ( == "Yes") threeg900[src]++
if ( == "Yes") volte[src]++
}
}
}
END {
for (i in arr_src) {
src=arr_src[i]
print src,twog[src],threeg[src],fourg[src],lte2100[src],lte800[src],lte700[src],lte1800[src],lte2600[src],lte900[src],threeg2100[src],threeg900[src],volte[src]
}
}
致电:
./counter.awk -v src=1037,1041,4975 combined_marketing_sadm_report.csv
更新
如果您的 src
值在文件中,则脚本 (counter-v2.awk
) 变为:
#! /usr/bin/awk -f
BEGIN {
FS=OFS=","
}
FNR == NR {
i++
arr_src[i] = [=12=]
next
}
FNR == 1 {
for (i in arr_src) {
src=arr_src[i]
twog[src]=threeg[src]=fourg[src]=lte2100[src]=lte800[src]=lte700[src]=lte1800[src]=lte2600[src]=lte900[src]=threeg2100[src]=threeg900[src]=volte[src]=0
}
}
{
for (i in arr_src) {
src=arr_src[i]
if ( == src) {
if (( == "Yes" || == "No") && == "No" && == "No") twog[src]++
if ( == "Yes" && == "No") threeg[src]++
if ( == "Yes") fourg[src]++
if ( == "Yes") lte2100[src]++
if ( == "Yes") lte800[src]++
if ( == "Yes") lte700[src]++
if ( == "Yes") lte1800[src]++
if ( == "Yes") lte2600[src]++
if ( == "Yes") lte900[src]++
if ( == "Yes") threeg2100[src]++
if ( == "Yes") threeg900[src]++
if ( == "Yes") volte[src]++
}
}
}
END {
for (i in arr_src) {
src=arr_src[i]
print src,twog[src],threeg[src],fourg[src],lte2100[src],lte800[src],lte700[src],lte1800[src],lte2600[src],lte900[src],threeg2100[src],threeg900[src],volte[src]
}
}
文件IDSs_file
:
1037
1041
1108
1184
1313
1381
1420
1604
1756
1820
3147
3211
4488
4975
这样执行(警告:文件顺序很重要):
./counter-v2.awk IDSs_file combined_marketing_sadm_report.csv
输出:
1037,0,0,1,1,1,1,1,1,1,1,1,1
1041,0,0,2,2,2,2,2,2,2,2,2,2
1108,0,0,1,1,1,1,1,1,1,1,1,1
1184,0,0,1,1,1,1,1,1,1,1,1,1
1313,0,0,1,1,1,1,1,1,1,1,1,1
1381,0,0,1,1,1,1,1,1,1,1,1,1
1420,0,0,1,1,1,1,1,1,1,1,1,1
1604,0,0,1,1,1,1,1,1,1,1,1,1
1756,0,0,1,1,1,1,1,1,1,1,1,1
1820,0,0,1,1,1,1,1,1,1,1,1,1
3147,0,0,1,1,1,1,1,0,0,0,0,1
3211,0,0,1,1,1,1,1,0,0,0,0,1
4488,0,0,1,1,1,1,1,1,1,1,1,1
4975,0,0,2,2,2,2,2,1,1,0,0,1
我有一个包含大约 1500 万条记录的文件。以下是数据示例
99001597,555555555555,3211,Njro_Kaniani,test,NORTH,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,IN2017,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001679,555555555555,1756,Bnju_HTT,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2012,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001680,555555555555,1108,Temoni_Kiara,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2028,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001683,555555555555,1604,Blue_Bay,Nzindo,,Y,COAST,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1820,Sgerea_Makuka,Salaam,,N,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1184,Makka,Salaam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1381,Leaders_Club,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1037,Mbez,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1041,Ngano,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001683,555555555555,1313,Kichangani,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2011,OnePlus,1,N/A,Yes,Yes,Yes,N/A
99001684,555555555555,4975,Nyugusu Campp2,Test,test,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001684,555555555555,1041,Ngano,Salam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001684,555555555555,1420,Airport_Macro,Salaam,,Y,RAD,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001684,555555555555,3147,Technical_Nzoti,test,ORTH,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001684,555555555555,4488,Lumala,Mwnza,,Y,Nyeka,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google,Yes,Yes,Yes,N/A
99001684,555555555555,4975,Nyarugusu Campp2,Kigoma,,Y,Nyeka,Yes,Yes,Yes,smart,Yes,Yes,Yes,Yes,Yes,Yes,BE2026,OnePlus,1,Google
我正在使用下面的脚本来计算符合特定条件的行的出现次数。问题是这个脚本很慢。我一天得到大约 200 条输出线。 目前,我的程序将读取 1500 万记录文件 36,000 次。这是非常低效的(慢!!)。我如何修改我的脚本以只读取一次非常大的文件?
期望的输出
1037,0,0,1,1,1,1,1,1,1,1,1,1
1041,0,0,2,2,2,2,2,2,2,2,2,2
1108,0,0,1,1,1,1,1,1,1,1,1,1
1184,0,0,1,1,1,1,1,1,1,1,1,1
1313,0,0,1,1,1,1,1,1,1,1,1,1
1381,0,0,1,1,1,1,1,1,1,1,1,1
1420,0,0,1,1,1,1,1,1,1,1,1,1
1604,0,0,1,1,1,1,1,1,1,1,1,1
1756,0,0,1,1,1,1,1,1,1,1,1,1
1820,0,0,1,1,1,1,1,1,1,1,1,1
3147,0,0,1,1,1,1,1,0,0,0,0,1
3211,0,0,1,1,1,1,1,0,0,0,0,1
4488,0,0,1,1,1,1,1,1,1,1,1,1
4975,0,0,2,2,2,2,2,1,1,0,0,1
IDs_file 文件包含大约 3000 条记录,每条记录一个 4 位数字
while read i
do
twog=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if ((( == "Yes")||( == "No")) && ( == src) && ( == "No")&& ( == "No")) print [=12=];}'|wc -l)
threeg=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F,'{if (( == src) &&( == "Yes")&& ( == "No")) print [=12=];}'|wc -l)
fourg=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F,'{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
lte2100=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
lte800=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
lte700=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
lte1800=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
lte2600=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
lte900=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
threeg2100=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
threeg900=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
volte=$(cat combined_marketing_sadm_report.csv|awk -v src=$i -F, '{if (( == "Yes") && ( == src)) print [=12=];}'|wc -l)
echo $i,$twog,$threeg,$fourg,$lte2100,$lte800,$lte700,$lte1800,$lte2600,$lte900,$threeg2100,$threeg900,$volte>>Raw_data_for_report.csv
done < IDs_file
解决方案:将所有循环放在一个 awk 程序中:
#! /usr/bin/awk -f
BEGIN {
FS=OFS=","
if (src=="") {
exit
}
split(src,arr_src,",")
for (i in arr_src) {
src=arr_src[i]
twog[src]=threeg[src]=fourg[src]=lte2100[src]=lte800[src]=lte700[src]=lte1800[src]=lte2600[src]=lte900[src]=threeg2100[src]=threeg900[src]=volte[src]=0
}
}
{
for (i in arr_src) {
src=arr_src[i]
if ( == src) {
if (( == "Yes" || == "No") && == "No" && == "No") twog[src]++
if ( == "Yes" && == "No") threeg[src]++
if ( == "Yes") fourg[src]++
if ( == "Yes") lte2100[src]++
if ( == "Yes") lte800[src]++
if ( == "Yes") lte700[src]++
if ( == "Yes") lte1800[src]++
if ( == "Yes") lte2600[src]++
if ( == "Yes") lte900[src]++
if ( == "Yes") threeg2100[src]++
if ( == "Yes") threeg900[src]++
if ( == "Yes") volte[src]++
}
}
}
END {
for (i in arr_src) {
src=arr_src[i]
print src,twog[src],threeg[src],fourg[src],lte2100[src],lte800[src],lte700[src],lte1800[src],lte2600[src],lte900[src],threeg2100[src],threeg900[src],volte[src]
}
}
致电:
./counter.awk -v src=1037,1041,4975 combined_marketing_sadm_report.csv
更新
如果您的 src
值在文件中,则脚本 (counter-v2.awk
) 变为:
#! /usr/bin/awk -f
BEGIN {
FS=OFS=","
}
FNR == NR {
i++
arr_src[i] = [=12=]
next
}
FNR == 1 {
for (i in arr_src) {
src=arr_src[i]
twog[src]=threeg[src]=fourg[src]=lte2100[src]=lte800[src]=lte700[src]=lte1800[src]=lte2600[src]=lte900[src]=threeg2100[src]=threeg900[src]=volte[src]=0
}
}
{
for (i in arr_src) {
src=arr_src[i]
if ( == src) {
if (( == "Yes" || == "No") && == "No" && == "No") twog[src]++
if ( == "Yes" && == "No") threeg[src]++
if ( == "Yes") fourg[src]++
if ( == "Yes") lte2100[src]++
if ( == "Yes") lte800[src]++
if ( == "Yes") lte700[src]++
if ( == "Yes") lte1800[src]++
if ( == "Yes") lte2600[src]++
if ( == "Yes") lte900[src]++
if ( == "Yes") threeg2100[src]++
if ( == "Yes") threeg900[src]++
if ( == "Yes") volte[src]++
}
}
}
END {
for (i in arr_src) {
src=arr_src[i]
print src,twog[src],threeg[src],fourg[src],lte2100[src],lte800[src],lte700[src],lte1800[src],lte2600[src],lte900[src],threeg2100[src],threeg900[src],volte[src]
}
}
文件IDSs_file
:
1037
1041
1108
1184
1313
1381
1420
1604
1756
1820
3147
3211
4488
4975
这样执行(警告:文件顺序很重要):
./counter-v2.awk IDSs_file combined_marketing_sadm_report.csv
输出:
1037,0,0,1,1,1,1,1,1,1,1,1,1
1041,0,0,2,2,2,2,2,2,2,2,2,2
1108,0,0,1,1,1,1,1,1,1,1,1,1
1184,0,0,1,1,1,1,1,1,1,1,1,1
1313,0,0,1,1,1,1,1,1,1,1,1,1
1381,0,0,1,1,1,1,1,1,1,1,1,1
1420,0,0,1,1,1,1,1,1,1,1,1,1
1604,0,0,1,1,1,1,1,1,1,1,1,1
1756,0,0,1,1,1,1,1,1,1,1,1,1
1820,0,0,1,1,1,1,1,1,1,1,1,1
3147,0,0,1,1,1,1,1,0,0,0,0,1
3211,0,0,1,1,1,1,1,0,0,0,0,1
4488,0,0,1,1,1,1,1,1,1,1,1,1
4975,0,0,2,2,2,2,2,1,1,0,0,1