如果 Id 列匹配,则合并两个 csv 文件
Merge two csv files if Id columns match
我有以下内容:
file1.csv
"Id","clientName1","clientName2"
file2.csv
"Id","Name1","Name2"
我想按顺序阅读 file1。对于每条记录,我想检查 file2 中是否有匹配的 Id
。可能有不止一场比赛。对于每场比赛,我想将 Name1, Name2
附加到 file1.csv
的记录末尾
因此,如果一条记录在 file2 中有多个匹配项,可能的结果是:
"Id","clientName1","clientName2","Name1","Name2","Name1","Name2"
恐怕 bash 可能不是有效的解决方案,但以下 bash 脚本可以工作:
#!/bin/bash
declare -A id_hash
while read line; do
id=$(echo $line | cut -d ',' -f 1)
name=$(echo $line | cut -d ',' -f 2-)
if [ -z "${id_hash[$id]}" ]; then
id_hash[$id]=$name
else
id_hash[$id]=${id_hash[$id]},$name
fi
done < file1.csv
while read line; do
id=$(echo $line | cut -d ',' -f 1)
name=$(echo $line | cut -d ',' -f 2-)
if [ -z "${id_hash[$id]}" ]; then
id_hash[$id]=$name
else
id_hash[$id]=${id_hash[$id]},$name
fi
done < file2.csv
for id in ${!id_hash[@]}; do
echo $id,${id_hash[$id]}
done
回应, here is the revised version of the single awk
command which does merge in case there was duplicated IDs either in file1 or file2 or in both and if with different number of fields. old version which it works for OP's current stated question
awk -F',' '{one=;="";a[one]=a[one][=10=]} END{for (i in a) print i""a[i]}' OFS=, file[12]
对于输入:
file1
"Id1","clientN1","clientN2"
"Id2","Name3","Name4"
"Id3","client00","client01","client02"
"Id1","client1","client2","client3"
file2
"Id1","Name1","Name2"
"Id1","Name3","Name4"
"Id2","Name0","Name1"
"Id2","Name00","Name11","Name22"
输出在同一个 [=28 上合并 file1
和 file2
=]IDs:
"Id1","clientN1","clientN2","client1","client2","client3","Name1","Name2","Name3","Name4"
"Id2","Name3","Name4","Name0","Name1","Name00","Name11","Name22"
"Id3","client00","client01","client02"
使用 join
和 GNU sed
的正则表达式解决方案
join -t , -a 1 file[12].csv | sed -r '$!N;/^(.*,)(.*)\n/!P;s//\n,/;D'
假设file1.csv和file2.csv都是按id排序的,没有header
file1.csv
1,c11,c12
2,c21,c22
3,c31,c32
file2.csv
1,n11,n12
1,n21,n22
1,n31,n32
2,n41,n42
给出了
的结果
1,c11,c12,n11,n12,n21,n22,n31,n32
2,c21,c22,n41,n42
3,c31,c32
更新
如果 file1.csv
可能包含 重复 ID 和 各种字段长度 ,我建议执行 pre-process 以确保 file1.csv
在加入 file2.csv
之前是干净的
awk -F, '{for(i=2;i<=NF;i++) print FS $i}' file1.csv |\
sort -u |\
sed -r '$!N;/^(.*,)(.*)\n/!P;s//\n,/;D'
- 第一个 awk 进程将所有数据拆分为 (id, name) 对
sort -u
对每对进行排序和唯一化
- 最后一个 sed 进程将具有相同 ID 的所有对合并为一行
输入
1,c11,c12
1,c12,c14,c13
1,c15,c12
2,c21,c22
输出
1,c11,c12,c13,c14,c15
2,c21,c22
感谢大家,但已经完成了。我写的代码如下:
#!/bin/bash
echo
echo 'Merging files into one'
IFS=","
while read id lname fname dnaid status type program startdt enddt ref email dob age add1 add2 city postal phone1 phone2
do
var="$dnaid,$lname,$fname,$status,$type,$program,$startdt,$enddt,$ref,$email,$dob,$age,$add1,$add2,$city,$postal,$phone1,$phone2"
while read id2 cwlname cwfname
do
if [ $id == $id2 ]
then
var="$var,$cwlname,$cwfname"
fi
done < file2.csv
echo "$var" >> /root/scijoinedfile.csv
done < file1.csv
echo
echo "Merging completed"
我有以下内容:
file1.csv
"Id","clientName1","clientName2"
file2.csv
"Id","Name1","Name2"
我想按顺序阅读 file1。对于每条记录,我想检查 file2 中是否有匹配的 Id
。可能有不止一场比赛。对于每场比赛,我想将 Name1, Name2
附加到 file1.csv
因此,如果一条记录在 file2 中有多个匹配项,可能的结果是:
"Id","clientName1","clientName2","Name1","Name2","Name1","Name2"
恐怕 bash 可能不是有效的解决方案,但以下 bash 脚本可以工作:
#!/bin/bash
declare -A id_hash
while read line; do
id=$(echo $line | cut -d ',' -f 1)
name=$(echo $line | cut -d ',' -f 2-)
if [ -z "${id_hash[$id]}" ]; then
id_hash[$id]=$name
else
id_hash[$id]=${id_hash[$id]},$name
fi
done < file1.csv
while read line; do
id=$(echo $line | cut -d ',' -f 1)
name=$(echo $line | cut -d ',' -f 2-)
if [ -z "${id_hash[$id]}" ]; then
id_hash[$id]=$name
else
id_hash[$id]=${id_hash[$id]},$name
fi
done < file2.csv
for id in ${!id_hash[@]}; do
echo $id,${id_hash[$id]}
done
回应awk
command which does merge in case there was duplicated IDs either in file1 or file2 or in both and if with different number of fields. old version which it works for OP's current stated question
awk -F',' '{one=;="";a[one]=a[one][=10=]} END{for (i in a) print i""a[i]}' OFS=, file[12]
对于输入:
file1
"Id1","clientN1","clientN2" "Id2","Name3","Name4" "Id3","client00","client01","client02" "Id1","client1","client2","client3"
file2
"Id1","Name1","Name2" "Id1","Name3","Name4" "Id2","Name0","Name1" "Id2","Name00","Name11","Name22"
输出在同一个 [=28 上合并 file1
和 file2
=]IDs:
"Id1","clientN1","clientN2","client1","client2","client3","Name1","Name2","Name3","Name4"
"Id2","Name3","Name4","Name0","Name1","Name00","Name11","Name22"
"Id3","client00","client01","client02"
使用 join
和 GNU sed
join -t , -a 1 file[12].csv | sed -r '$!N;/^(.*,)(.*)\n/!P;s//\n,/;D'
假设file1.csv和file2.csv都是按id排序的,没有header
file1.csv
1,c11,c12
2,c21,c22
3,c31,c32
file2.csv
1,n11,n12
1,n21,n22
1,n31,n32
2,n41,n42
给出了
的结果1,c11,c12,n11,n12,n21,n22,n31,n32
2,c21,c22,n41,n42
3,c31,c32
更新
如果 file1.csv
可能包含 重复 ID 和 各种字段长度 ,我建议执行 pre-process 以确保 file1.csv
在加入 file2.csv
awk -F, '{for(i=2;i<=NF;i++) print FS $i}' file1.csv |\
sort -u |\
sed -r '$!N;/^(.*,)(.*)\n/!P;s//\n,/;D'
- 第一个 awk 进程将所有数据拆分为 (id, name) 对
sort -u
对每对进行排序和唯一化- 最后一个 sed 进程将具有相同 ID 的所有对合并为一行
输入
1,c11,c12
1,c12,c14,c13
1,c15,c12
2,c21,c22
输出
1,c11,c12,c13,c14,c15
2,c21,c22
感谢大家,但已经完成了。我写的代码如下:
#!/bin/bash
echo
echo 'Merging files into one'
IFS=","
while read id lname fname dnaid status type program startdt enddt ref email dob age add1 add2 city postal phone1 phone2
do
var="$dnaid,$lname,$fname,$status,$type,$program,$startdt,$enddt,$ref,$email,$dob,$age,$add1,$add2,$city,$postal,$phone1,$phone2"
while read id2 cwlname cwfname
do
if [ $id == $id2 ]
then
var="$var,$cwlname,$cwfname"
fi
done < file2.csv
echo "$var" >> /root/scijoinedfile.csv
done < file1.csv
echo
echo "Merging completed"