AWK 在一个文件中的记录中搜索另一个文件中的条目
AWK searching records in one file for entries in another file
我有一个 results.csv 文件,其中包含以下布局中的名称:
name1, 2(random number)
name5, 3
和一个sample.txt,结构如下
record_seperator
name1
foo
bar
record_seperator
name2
bla
bluh
我想在 sample.txt 文件中搜索 results.csv 中的每个名称,如果找到则将记录输出到文件中。
我试图从第一个文件生成一个数组并搜索它,但我无法获得正确的语法。
它需要在 bash 脚本中 运行。如果有人有比 awk 更好的主意,那也很好,但我没有机器的管理员权限,它应该 运行。
真正的 csv 文件包含 10.000 个名称和 sample.txt 450 万条记录。
我是 awk 的初学者,所以非常感谢解释。
这是我目前的尝试,它不起作用,我不知道为什么:
#!/bin/bash
awk 'BEGIN{
while (getline < "results.csv")
{
split([=12=],name,",");
nameArr[k]=name[1];
}
{
RS="record_seperator"
FS="\n"
for (key in nameArr)
{
print nameArr[key]
print
if (==nameArr[key])
NR > 1
{
#extract file by Record separator and name from line2
print RS [=12=] > ".txt"
}
}
}
}' sample.txt
编辑:
我的预期输出将是两个文件:
name1.txt
record_seperator
name1
foo
bar
name2.txt
record_seperator
name2
bla
bluh
这是一个。 由于没有预期的输出,它只输出原始记录:
$ awk '
NR==FNR { # process first file
a[]=RS [=10=] # hash the whole record with first field (name) as key
next # process next record in the first file
} # after this line second file processing
in a { # if first field value (name) is found in hash a
f= ".txt" # generate filename
print a[] > f # output the whole record
close(f) # preserving fds
}' RS="record_seperator\n" sample RS="\n" FS="," results # file order and related vars
只有一场比赛:
$ cat name1.txt
record_seperator
name1
foo
bar
在 gawk 和 mawk 上测试,在原始 awk 上表现怪异。
类似这样,(未测试)
$ awk -F, 'NR==FNR {a[]; next} # fill array with names from first file
in a {print rt, [=10=] > (".txt")} # print the record from second file
{rt = RT}' results.csv RS="define_it_here" sample.txt
由于您的记录分隔符在记录之前,您需要将其延迟一位。
使用 line/record 迭代器中的构建而不是解决它。
(在@Tiw 的带领下,我还在您的结果文件中将 name5 更改为 name2 以获得预期的输出)
$ cat a.awk
# collect the result names into an array
NR == FNR {a[]; next}
# skip the first (empty) sample record caused by initial record separator
FNR == 1 { next }
# If found, output sample record into the appropriate file
in a {
f = ( ".txt")
printf "record_seperator\n%s", [=10=] > f
}
运行 使用 gawk 进行多字符 RS:
$ gawk -f a.awk FS="," results.csv FS="\n" RS="record_seperator\n" sample.txt
检查结果:
$ cat name1.txt
record_seperator
name1
foo
bar
$ cat name2.txt
record_seperator
name2
bla
bluh
您的代码错误:
#!/bin/bash
awk 'BEGIN{
while (getline < "results.csv")
{
split([=10=],name,",");
nameArr[k]=name[1]; ## <-- k not exists, you are rewriting nameArr[""] again and again.
}
{
RS="record_seperator"
FS="\n"
for (key in nameArr) ## <-- only one key "" exists, it's never gonna equal to
{
print nameArr[key]
print
if (==nameArr[key])
NR > 1
{
#extract file by Record separator and name from line2
print RS [=10=] > ".txt"
}
}
}
}' sample.txt
还有你展示的样本:
name1, 2(random number)
name5, 3 ## <-- name5 here, not name2 !
将 name5
更改为 name2
,并更新了您自己的代码:
#!/bin/bash
awk 'BEGIN{
while ( (getline line< "results.csv") > 0 ) { # Avoid infinite loop when read erorr encountered.
split(line,name,",");
nameArr[name[1]]; # Actually no need do anything, just refer once to establish the key (name[1]).
}
RS="record_seperator";
FS="\n";
}
in nameArr {
print RS [=12=]; #You can add `> ".txt"` later yourself.
}' sample.txt
输出:
record_seperator
name1
foo
bar
record_seperator
name2
bla
bluh
我有一个 results.csv 文件,其中包含以下布局中的名称:
name1, 2(random number)
name5, 3
和一个sample.txt,结构如下
record_seperator
name1
foo
bar
record_seperator
name2
bla
bluh
我想在 sample.txt 文件中搜索 results.csv 中的每个名称,如果找到则将记录输出到文件中。 我试图从第一个文件生成一个数组并搜索它,但我无法获得正确的语法。 它需要在 bash 脚本中 运行。如果有人有比 awk 更好的主意,那也很好,但我没有机器的管理员权限,它应该 运行。 真正的 csv 文件包含 10.000 个名称和 sample.txt 450 万条记录。 我是 awk 的初学者,所以非常感谢解释。 这是我目前的尝试,它不起作用,我不知道为什么:
#!/bin/bash
awk 'BEGIN{
while (getline < "results.csv")
{
split([=12=],name,",");
nameArr[k]=name[1];
}
{
RS="record_seperator"
FS="\n"
for (key in nameArr)
{
print nameArr[key]
print
if (==nameArr[key])
NR > 1
{
#extract file by Record separator and name from line2
print RS [=12=] > ".txt"
}
}
}
}' sample.txt
编辑: 我的预期输出将是两个文件:
name1.txt
record_seperator
name1
foo
bar
name2.txt
record_seperator
name2
bla
bluh
这是一个。 由于没有预期的输出,它只输出原始记录:
$ awk '
NR==FNR { # process first file
a[]=RS [=10=] # hash the whole record with first field (name) as key
next # process next record in the first file
} # after this line second file processing
in a { # if first field value (name) is found in hash a
f= ".txt" # generate filename
print a[] > f # output the whole record
close(f) # preserving fds
}' RS="record_seperator\n" sample RS="\n" FS="," results # file order and related vars
只有一场比赛:
$ cat name1.txt
record_seperator
name1
foo
bar
在 gawk 和 mawk 上测试,在原始 awk 上表现怪异。
类似这样,(未测试)
$ awk -F, 'NR==FNR {a[]; next} # fill array with names from first file
in a {print rt, [=10=] > (".txt")} # print the record from second file
{rt = RT}' results.csv RS="define_it_here" sample.txt
由于您的记录分隔符在记录之前,您需要将其延迟一位。
使用 line/record 迭代器中的构建而不是解决它。
(在@Tiw 的带领下,我还在您的结果文件中将 name5 更改为 name2 以获得预期的输出)
$ cat a.awk
# collect the result names into an array
NR == FNR {a[]; next}
# skip the first (empty) sample record caused by initial record separator
FNR == 1 { next }
# If found, output sample record into the appropriate file
in a {
f = ( ".txt")
printf "record_seperator\n%s", [=10=] > f
}
运行 使用 gawk 进行多字符 RS:
$ gawk -f a.awk FS="," results.csv FS="\n" RS="record_seperator\n" sample.txt
检查结果:
$ cat name1.txt
record_seperator
name1
foo
bar
$ cat name2.txt
record_seperator
name2
bla
bluh
您的代码错误:
#!/bin/bash
awk 'BEGIN{
while (getline < "results.csv")
{
split([=10=],name,",");
nameArr[k]=name[1]; ## <-- k not exists, you are rewriting nameArr[""] again and again.
}
{
RS="record_seperator"
FS="\n"
for (key in nameArr) ## <-- only one key "" exists, it's never gonna equal to
{
print nameArr[key]
print
if (==nameArr[key])
NR > 1
{
#extract file by Record separator and name from line2
print RS [=10=] > ".txt"
}
}
}
}' sample.txt
还有你展示的样本:
name1, 2(random number)
name5, 3 ## <-- name5 here, not name2 !
将 name5
更改为 name2
,并更新了您自己的代码:
#!/bin/bash
awk 'BEGIN{
while ( (getline line< "results.csv") > 0 ) { # Avoid infinite loop when read erorr encountered.
split(line,name,",");
nameArr[name[1]]; # Actually no need do anything, just refer once to establish the key (name[1]).
}
RS="record_seperator";
FS="\n";
}
in nameArr {
print RS [=12=]; #You can add `> ".txt"` later yourself.
}' sample.txt
输出:
record_seperator
name1
foo
bar
record_seperator
name2
bla
bluh