AWK 在一个文件中的记录中搜索另一个文件中的条目

Question

我有一个 results.csv 文件，其中包含以下布局中的名称：

name1, 2(random number)  
name5, 3

和一个sample.txt，结构如下

record_seperator
name1
foo
bar
record_seperator
name2
bla
bluh

我想在 sample.txt 文件中搜索 results.csv 中的每个名称，如果找到则将记录输出到文件中。我试图从第一个文件生成一个数组并搜索它，但我无法获得正确的语法。它需要在 bash 脚本中运行。如果有人有比 awk 更好的主意，那也很好，但我没有机器的管理员权限，它应该运行。真正的 csv 文件包含 10.000 个名称和 sample.txt 450 万条记录。我是 awk 的初学者，所以非常感谢解释。这是我目前的尝试，它不起作用，我不知道为什么：

#!/bin/bash
awk 'BEGIN{
while (getline < "results.csv")
{
split([=12=],name,",");
nameArr[k]=name[1];
}
{
RS="record_seperator"
FS="\n"
for (key in nameArr)
        {
         print nameArr[key]
         print 
         if (==nameArr[key])
                 NR > 1
                 {
                #extract file by Record separator and name from line2
                print RS [=12=] >  ".txt"
                }
        }
}
}' sample.txt

编辑：我的预期输出将是两个文件：

name1.txt

record_seperator
name1
foo
bar

name2.txt

record_seperator
name2
bla
bluh

Answer 1

这是一个。 ~~由于没有预期的输出，它只输出原始记录~~:

$ awk '
NR==FNR {              # process first file 
    a[]=RS [=10=]        # hash the whole record with first field (name) as key 
    next               # process next record in the first file
}                      # after this line second file processing
 in a {              # if first field value (name) is found in hash a
    f= ".txt"        # generate filename
    print a[] > f    # output the whole record
    close(f)           # preserving fds
}' RS="record_seperator\n" sample RS="\n" FS="," results  # file order and related vars

只有一场比赛：

$ cat name1.txt
record_seperator
name1
foo
bar

在 gawk 和 mawk 上测试，在原始 awk 上表现怪异。

Answer 2

类似这样，（未测试）

$ awk -F, 'NR==FNR {a[]; next}                  # fill array with names from first file
            in a {print rt, [=10=] > (".txt")}    # print the record from second file
                   {rt = RT}' results.csv RS="define_it_here" sample.txt

由于您的记录分隔符在记录之前，您需要将其延迟一位。

使用 line/record 迭代器中的构建而不是解决它。

Answer 3

（在@Tiw 的带领下，我还在您的结果文件中将 name5 更改为 name2 以获得预期的输出）

$ cat a.awk
# collect the result names into an array
NR == FNR {a[]; next}

# skip the first (empty) sample record caused by initial record separator
FNR ==  1 { next }

# If found, output sample record into the appropriate file
 in a {
    f =  ( ".txt")
    printf "record_seperator\n%s", [=10=]  > f
}

运行使用 gawk 进行多字符 RS:

$ gawk -f a.awk FS="," results.csv FS="\n" RS="record_seperator\n" sample.txt

检查结果：

$ cat name1.txt
record_seperator
name1
foo
bar
$ cat name2.txt
record_seperator
name2
bla
bluh

Answer 4

您的代码错误：

#!/bin/bash
awk 'BEGIN{
while (getline < "results.csv")
{
split([=10=],name,",");
nameArr[k]=name[1];  ## <-- k not exists, you are rewriting nameArr[""] again and again.
}
{
RS="record_seperator"
FS="\n"
for (key in nameArr) ## <-- only one key "" exists, it's never gonna equal to 
        {
         print nameArr[key]  
         print 
         if (==nameArr[key])
                 NR > 1
                 {
                #extract file by Record separator and name from line2
                print RS [=10=] >  ".txt"
                }
        }
}
}' sample.txt

还有你展示的样本：

name1, 2(random number)  
name5, 3  ## <-- name5 here, not name2 !

将 name5 更改为 name2，并更新了您自己的代码：

#!/bin/bash
awk 'BEGIN{
    while ( (getline line< "results.csv") > 0 ) {  # Avoid infinite loop when read erorr encountered.
        split(line,name,",");
        nameArr[name[1]]; # Actually no need do anything, just refer once to establish the key (name[1]).
    }
    RS="record_seperator";
    FS="\n";
}

 in nameArr {
        print RS [=12=];  #You can add `>  ".txt"` later yourself.
}' sample.txt

输出：

record_seperator 
name1            
foo              
bar              

record_seperator 
name2            
bla              
bluh

AWK 在一个文件中的记录中搜索另一个文件中的条目

AWK searching records in one file for entries in another file

arrays

awk

full-text-search