如果 Id 列匹配,则合并两个 csv 文件

Merge two csv files if Id columns match

我有以下内容:

file1.csv

"Id","clientName1","clientName2"

file2.csv

"Id","Name1","Name2"

我想按顺序阅读 file1。对于每条记录,我想检查 file2 中是否有匹配的 Id。可能有不止一场比赛。对于每场比赛,我想将 Name1, Name2 附加到 file1.csv

的记录末尾

因此,如果一条记录在 file2 中有多个匹配项,可能的结果是:

"Id","clientName1","clientName2","Name1","Name2","Name1","Name2"

恐怕 bash 可能不是有效的解决方案,但以下 bash 脚本可以工作:

#!/bin/bash

declare -A id_hash

while read line; do
    id=$(echo $line | cut -d ',' -f 1)
    name=$(echo $line | cut -d ',' -f 2-)
    if [ -z "${id_hash[$id]}" ]; then
        id_hash[$id]=$name
    else
        id_hash[$id]=${id_hash[$id]},$name
    fi
done < file1.csv

while read line; do
    id=$(echo $line | cut -d ',' -f 1)
    name=$(echo $line | cut -d ',' -f 2-)
    if [ -z "${id_hash[$id]}" ]; then
        id_hash[$id]=$name
    else
        id_hash[$id]=${id_hash[$id]},$name
    fi
done < file2.csv

for id in ${!id_hash[@]}; do
    echo $id,${id_hash[$id]}
done

回应, here is the revised version of the single awk command which does merge in case there was duplicated IDs either in file1 or file2 or in both and if with different number of fields. old version which it works for OP's current stated question

awk -F',' '{one=;="";a[one]=a[one][=10=]} END{for (i in a) print i""a[i]}' OFS=, file[12]

对于输入:

file1

"Id1","clientN1","clientN2"
"Id2","Name3","Name4"
"Id3","client00","client01","client02"
"Id1","client1","client2","client3"

file2

"Id1","Name1","Name2"
"Id1","Name3","Name4"
"Id2","Name0","Name1"
"Id2","Name00","Name11","Name22"

输出在同一个 [=28 上合并 file1file2 =]IDs:

"Id1","clientN1","clientN2","client1","client2","client3","Name1","Name2","Name3","Name4"
"Id2","Name3","Name4","Name0","Name1","Name00","Name11","Name22"
"Id3","client00","client01","client02"

使用 joinGNU sed

的正则表达式解决方案
join -t , -a 1 file[12].csv | sed -r '$!N;/^(.*,)(.*)\n/!P;s//\n,/;D'

假设file1.csv和file2.csv都是按id排序的,没有header

file1.csv

1,c11,c12
2,c21,c22
3,c31,c32

file2.csv

1,n11,n12
1,n21,n22
1,n31,n32
2,n41,n42

给出了

的结果
1,c11,c12,n11,n12,n21,n22,n31,n32
2,c21,c22,n41,n42
3,c31,c32

更新

如果 file1.csv 可能包含 重复 ID 各种字段长度 ,我建议执行 pre-process 以确保 file1.csv 在加入 file2.csv

之前是干净的
awk -F, '{for(i=2;i<=NF;i++) print  FS $i}' file1.csv |\
    sort -u |\
    sed -r '$!N;/^(.*,)(.*)\n/!P;s//\n,/;D'
  • 第一个 awk 进程将所有数据拆分为 (id, name) 对
  • sort -u 对每对进行排序和唯一化
  • 最后一个 sed 进程将具有相同 ID 的所有对合并为一行

输入

1,c11,c12
1,c12,c14,c13
1,c15,c12
2,c21,c22

输出

1,c11,c12,c13,c14,c15
2,c21,c22

感谢大家,但已经完成了。我写的代码如下:

#!/bin/bash

echo
echo 'Merging files into one'

IFS=","
while read id lname fname dnaid status type program startdt enddt ref email dob age add1 add2 city postal phone1 phone2

do
var="$dnaid,$lname,$fname,$status,$type,$program,$startdt,$enddt,$ref,$email,$dob,$age,$add1,$add2,$city,$postal,$phone1,$phone2"

  while read id2 cwlname cwfname
  do
       if [ $id == $id2 ]
       then
           var="$var,$cwlname,$cwfname"
       fi

  done < file2.csv

  echo "$var" >> /root/scijoinedfile.csv

done < file1.csv

echo
echo "Merging completed"