使用 awk 解析文本文件的各个部分
Using awk to parse sections of a text file
我的脚本有 2 个问题:
- 将正确的变量传递给 awk
- Awk 不喜欢用于指定要在指定模式之间打印的开始值和结束值的特定命令。
这里是states.txt的内容:
Alabama
Area: 52,423 sq.mi (135,775 sq.km.), 30th
Land: 50,750 sq.mi. (131,442 sq.km.), 28th
Water: 1,673 sq.mi. (4,333 sq.km.), 23rd
Coastline: 53 mi. (85 km.), 17th
Shoreline: 607 mi. (977 km.), 19th
Alaska
Area: 656,425 sq.mi (1,700,134 sq.km.), 1st
Land: 570,374 sq.mi. (1,477,263 sq.km.), 1st
Water: 86,051 sq.mi. (222,871 sq.km.), 1st
Coastline: 6,640 mi. (10,686 km.), 1st
Shoreline: 33,904 mi. (54,563 km.), 1st
Arizona
Area: 114,006 sq.mi (295,274 sq.km.), 6th
Land: 113,642 sq.mi. (294,332 sq.km.), 6th
Water: 364 sq.mi. (943 sq.km.), 48th
Arkansas
Area: 53,182 sq.mi (137,741 sq.km.), 29th
Land: 52,075 sq.mi. (134,874 sq.km.), 27th
Water: 1,107 sq.mi. (2,867 sq.km.), 31st
California
Area: 163,707 sq.mi (423,999 sq.km.), 3rd
Land: 155,973 sq.mi. (403,969 sq.km.), 3rd
Water: 7,734 sq.mi. (20,031 sq.km.), 6th
Coastline: 840 mi. (1,352 km.), 3rd
Shoreline: 3,427 mi. (5,515 km.), 5th
Colorado
Area: 104,100 sq.mi (269,618 sq.km.), 8th
Land: 103,730 sq.mi. (268,660 sq.km.), 8th
Water: 371 sq.mi. (961 sq.km.), 46th'
以此类推
我想做的是开发一个脚本,在解析时分别提取每个状态的信息。
所以脚本看起来像这样:
for state in $(cat states.txt | egrep -v 'Area|Land|Water' | grep [A-Z]) ; do
echo $state >> ./statelist.txt ;
done ;
for statesnip in $(cat ./statelist.txt | awk 'NR>1{print p "_" [=11=] ORS} {p=[=11=]}' | grep [A-Z]) ; do
state1=$(echo $statesnip | awk -F _ '{print }') ;
state2=$(echo $statesnip | awk -F _ '{print }') ;
cat ./states.txt | awk '/$state1/{f=1}; /$state2/{f=0}' >> $state1.tmp.txt ;
done;
rm -f ./statelist.txt
这就是问题所在:
第一个,是传递给 awk 的变量:
如
awk -v state1=$state1 -v state2=$state2 '/state1/{f=1} f; /state2/{f=0}';
或
awk -v state1=${state1} state2=${state2} '/state1/{f=1} f; /state2/{f=0}';
我收到一个错误
第二个是当我将变量调整为它们的 -v 格式时 awk 不喜欢它(它只是 cat 是整个文件,无数次)。
awk -v state1=${state1} -v state2=${state2} 'state1{f=1} f; state2{f=0}'
我只是反复获取整个文件的完整目录。
预期输出应如下所示:
cat ./statelist.txt
Alabama
Alaska
Arizona
Arkansas
California
Colorado
cat ./statelist.txt | awk 'NR>1{print p "_" [=15=] ORS} {p=[=15=]}' | grep [A-Z]
Alabama_Alaska
Alaska_Arizona
Arizona_Arkansas
Arkansas_California
California_Colorado
cat ./Alabama.txt:
Alabama
Area: 52,423 sq.mi (135,775 sq.km.), 30th
Land: 50,750 sq.mi. (131,442 sq.km.), 28th
Water: 1,673 sq.mi. (4,333 sq.km.), 23rd
Coastline: 53 mi. (85 km.), 17th
Shoreline: 607 mi. (977 km.), 19th
cat ./Alaska.txt
Alaska
Area: 656,425 sq.mi (1,700,134 sq.km.), 1st
Land: 570,374 sq.mi. (1,477,263 sq.km.), 1st
Water: 86,051 sq.mi. (222,871 sq.km.), 1st
Coastline: 6,640 mi. (10,686 km.), 1st
Shoreline: 33,904 mi. (54,563 km.), 1st
cat ./Arizona.txt
Arizona
Area: 114,006 sq.mi (295,274 sq.km.), 6th
Land: 113,642 sq.mi. (294,332 sq.km.), 6th
Water: 364 sq.mi. (943 sq.km.), 48th
cat ./Arkansas.txt
Arkansas
Area: 53,182 sq.mi (137,741 sq.km.), 29th
Land: 52,075 sq.mi. (134,874 sq.km.), 27th
Water: 1,107 sq.mi. (2,867 sq.km.), 31st
cat ./California.txt
California
Area: 163,707 sq.mi (423,999 sq.km.), 3rd
Land: 155,973 sq.mi. (403,969 sq.km.), 3rd
Water: 7,734 sq.mi. (20,031 sq.km.), 6th
Coastline: 840 mi. (1,352 km.), 3rd
Shoreline: 3,427 mi. (5,515 km.), 5th
cat ./Colorado.txt
Colorado
Area: 104,100 sq.mi (269,618 sq.km.), 8th
Land: 103,730 sq.mi. (268,660 sq.km.), 8th
Water: 371 sq.mi. (961 sq.km.), 46th'
任何时候你在 shell 中编写循环只是为了操作文本,你的方法都是错误的。
在这种情况下,看起来您真正需要的是:
awk 'NF==1{out=".txt"} {print > out}' states.txt
如果不是,请说明。哦,对于非 gawk,您可能需要在 out=...
.
之前添加 close(out)
虽然问题暗示 awk 被用来解析文件,但给出的脚本使用了比 awk 更多的其他命令。 awk 可以用来完成整个工作。
awk \
' \
BEGIN \
{ FS = ":" }
NF == 1 && /^[A-Z]/ \
{ FILE = [=10=] ".txt"; printf "\n%s\n\n", [=10=] >FILE }
NF > 1 \
{ print >FILE }
' states.txt
虽然一个较小的脚本也能完成这项工作,但这个脚本还有一些额外的功能。使用冒号作为字段分隔符可以快速区分数据和标题行。忽略空白行,并使用 printf() 在输出文件中生成标题行。这意味着输入文件中不需要空格,也意味着额外的空格或空行不会弄乱输出。这可能是也可能不是你想要的。
我的脚本有 2 个问题:
- 将正确的变量传递给 awk
- Awk 不喜欢用于指定要在指定模式之间打印的开始值和结束值的特定命令。
这里是states.txt的内容:
Alabama
Area: 52,423 sq.mi (135,775 sq.km.), 30th
Land: 50,750 sq.mi. (131,442 sq.km.), 28th
Water: 1,673 sq.mi. (4,333 sq.km.), 23rd
Coastline: 53 mi. (85 km.), 17th
Shoreline: 607 mi. (977 km.), 19th
Alaska
Area: 656,425 sq.mi (1,700,134 sq.km.), 1st
Land: 570,374 sq.mi. (1,477,263 sq.km.), 1st
Water: 86,051 sq.mi. (222,871 sq.km.), 1st
Coastline: 6,640 mi. (10,686 km.), 1st
Shoreline: 33,904 mi. (54,563 km.), 1st
Arizona
Area: 114,006 sq.mi (295,274 sq.km.), 6th
Land: 113,642 sq.mi. (294,332 sq.km.), 6th
Water: 364 sq.mi. (943 sq.km.), 48th
Arkansas
Area: 53,182 sq.mi (137,741 sq.km.), 29th
Land: 52,075 sq.mi. (134,874 sq.km.), 27th
Water: 1,107 sq.mi. (2,867 sq.km.), 31st
California
Area: 163,707 sq.mi (423,999 sq.km.), 3rd
Land: 155,973 sq.mi. (403,969 sq.km.), 3rd
Water: 7,734 sq.mi. (20,031 sq.km.), 6th
Coastline: 840 mi. (1,352 km.), 3rd
Shoreline: 3,427 mi. (5,515 km.), 5th
Colorado
Area: 104,100 sq.mi (269,618 sq.km.), 8th
Land: 103,730 sq.mi. (268,660 sq.km.), 8th
Water: 371 sq.mi. (961 sq.km.), 46th'
以此类推
我想做的是开发一个脚本,在解析时分别提取每个状态的信息。
所以脚本看起来像这样:
for state in $(cat states.txt | egrep -v 'Area|Land|Water' | grep [A-Z]) ; do
echo $state >> ./statelist.txt ;
done ;
for statesnip in $(cat ./statelist.txt | awk 'NR>1{print p "_" [=11=] ORS} {p=[=11=]}' | grep [A-Z]) ; do
state1=$(echo $statesnip | awk -F _ '{print }') ;
state2=$(echo $statesnip | awk -F _ '{print }') ;
cat ./states.txt | awk '/$state1/{f=1}; /$state2/{f=0}' >> $state1.tmp.txt ;
done;
rm -f ./statelist.txt
这就是问题所在:
第一个,是传递给 awk 的变量:
如
awk -v state1=$state1 -v state2=$state2 '/state1/{f=1} f; /state2/{f=0}';
或
awk -v state1=${state1} state2=${state2} '/state1/{f=1} f; /state2/{f=0}';
我收到一个错误
第二个是当我将变量调整为它们的 -v 格式时 awk 不喜欢它(它只是 cat 是整个文件,无数次)。
awk -v state1=${state1} -v state2=${state2} 'state1{f=1} f; state2{f=0}'
我只是反复获取整个文件的完整目录。
预期输出应如下所示:
cat ./statelist.txt
Alabama
Alaska
Arizona
Arkansas
California
Colorado
cat ./statelist.txt | awk 'NR>1{print p "_" [=15=] ORS} {p=[=15=]}' | grep [A-Z]
Alabama_Alaska
Alaska_Arizona
Arizona_Arkansas
Arkansas_California
California_Colorado
cat ./Alabama.txt:
Alabama
Area: 52,423 sq.mi (135,775 sq.km.), 30th
Land: 50,750 sq.mi. (131,442 sq.km.), 28th
Water: 1,673 sq.mi. (4,333 sq.km.), 23rd
Coastline: 53 mi. (85 km.), 17th
Shoreline: 607 mi. (977 km.), 19th
cat ./Alaska.txt
Alaska
Area: 656,425 sq.mi (1,700,134 sq.km.), 1st
Land: 570,374 sq.mi. (1,477,263 sq.km.), 1st
Water: 86,051 sq.mi. (222,871 sq.km.), 1st
Coastline: 6,640 mi. (10,686 km.), 1st
Shoreline: 33,904 mi. (54,563 km.), 1st
cat ./Arizona.txt
Arizona
Area: 114,006 sq.mi (295,274 sq.km.), 6th
Land: 113,642 sq.mi. (294,332 sq.km.), 6th
Water: 364 sq.mi. (943 sq.km.), 48th
cat ./Arkansas.txt
Arkansas
Area: 53,182 sq.mi (137,741 sq.km.), 29th
Land: 52,075 sq.mi. (134,874 sq.km.), 27th
Water: 1,107 sq.mi. (2,867 sq.km.), 31st
cat ./California.txt
California
Area: 163,707 sq.mi (423,999 sq.km.), 3rd
Land: 155,973 sq.mi. (403,969 sq.km.), 3rd
Water: 7,734 sq.mi. (20,031 sq.km.), 6th
Coastline: 840 mi. (1,352 km.), 3rd
Shoreline: 3,427 mi. (5,515 km.), 5th
cat ./Colorado.txt
Colorado
Area: 104,100 sq.mi (269,618 sq.km.), 8th
Land: 103,730 sq.mi. (268,660 sq.km.), 8th
Water: 371 sq.mi. (961 sq.km.), 46th'
任何时候你在 shell 中编写循环只是为了操作文本,你的方法都是错误的。
在这种情况下,看起来您真正需要的是:
awk 'NF==1{out=".txt"} {print > out}' states.txt
如果不是,请说明。哦,对于非 gawk,您可能需要在 out=...
.
close(out)
虽然问题暗示 awk 被用来解析文件,但给出的脚本使用了比 awk 更多的其他命令。 awk 可以用来完成整个工作。
awk \
' \
BEGIN \
{ FS = ":" }
NF == 1 && /^[A-Z]/ \
{ FILE = [=10=] ".txt"; printf "\n%s\n\n", [=10=] >FILE }
NF > 1 \
{ print >FILE }
' states.txt
虽然一个较小的脚本也能完成这项工作,但这个脚本还有一些额外的功能。使用冒号作为字段分隔符可以快速区分数据和标题行。忽略空白行,并使用 printf() 在输出文件中生成标题行。这意味着输入文件中不需要空格,也意味着额外的空格或空行不会弄乱输出。这可能是也可能不是你想要的。