比较 linux 中的两个 CSV 文件

Question

我有两个 CSV 文件，格式如下：

文件 1:

No.1, No.2
983264,72342349
763498,81243970
736493,83740940

文件 2：

No.1,No.2
"7938493","7364987"
"2153187","7387910"
"736493","83740940"

我需要比较两个文件并输出匹配的、不匹配的值。我是通过awk做到的：

#!/bin/bash

awk 'BEGIN {
    FS = OFS = ","
}
if (FNR==1){next}
NR>1 && NR==FNR {
    a[];
    next
}
FNR>1 {
    print ( in a) ?  FS "Match" :  FS "In file2 but not in file1"
    delete a[]
}
END {
    for (x in a) {
        print x FS "In file1 but not in file2"
    }
}'file1 file2

但是输出是：

"7938493",In file2 but not in file1
"2153187",In file2 but not in file1
"8172470",In file2 but not in file1
7938493,In file1 but not in file2
2153187,In file1 but not in file2
8172470,In file1 but not in file2

你能告诉我哪里错了吗？

Answer 1

以下是对脚本的一些更正：

BEGIN {
    # FS = OFS = ","
    FS = "[,\"]+"
    OFS = ", "
}
# if (FNR==1){next}
FNR == 1 {next}

# NR>1 && NR==FNR {
NR==FNR {
    a[];
    next
}
# FNR>1 {
 in a {
    # print ( in a) ?  FS "Match" :  FS "In file2 but not in file1"
    print ( in a) ?  OFS "Match" :  "In file2 but not in file1"
    delete a[]
}
END {
    for (x in a) {
        print x, "In file1 but not in file2"
    }
}

这是一个 awk 脚本，因此您可以运行像 awk -f script.awk file1 file2 一样。这样做会得到这些结果：

$ awk -f script.awk file1 file2
736493, Match
763498, In file1 but not in file2
983264, In file1 but not in file2

您的脚本的主要问题是它没有正确处理 file2 中数字周围的双引号。我更改了输入字段分隔符，以便将双引号视为分隔符的一部分来处理这个问题。结果，第二个文件中的第一个字段</code>为空（它是行首和第一个<code>"之间的位），因此需要使用</code>来请参考您感兴趣的第一个值。除此之外，我从您的其他块中删除了一些冗余条件，并在您的第一个 <code>print 语句中使用 OFS 而不是 FS。

比较 linux 中的两个 CSV 文件

Comparing two CSV files in linux

linux

csv

awk