比较多个文件中的特定列并打印匹配的特定列
Compare specific columns across multiple files and print matched specific column
我有多个 csv 格式的文件(六个文件)。我正在尝试比较多个文件中的 $3、$4、$5,如果匹配则打印所有文件中的 $6 以及文件 1 中的列 $2、$3、$4、$5。
输入文件 1:
Blink,Seeddensity(g/cm^3),1_0002,VU10,37586764,0.458533399568206
Blink,Seeddensity(g/cm^3),1_0004,VU08,37687622,0.548181169267479
Blink,Seeddensity(g/cm^3),1_0006,VU02,6629660,0.553099787284982
输入文件2:
Farmcpu,Seeddensity(g/cm^3),1_0002,VU10,37586764,0.907010463957269
Farmcpu,Seeddensity(g/cm^3),1_0004,VU08,37687622,0.782521980037194
Farmcpu,Seeddensity(g/cm^3),1_0006,VU02,6629660,0.589126094555234
输入文件 3:
GLM,Seeddensity(g/cm^3),1_0002,VU10,37586764,0.24089
GLM,Seeddensity(g/cm^3),1_0004,VU08,37687622,0.25771
GLM,Seeddensity(g/cm^3),1_0006,VU02,6629660,0.31282
期望的输出:
Trait Marker Chr Pos Blink Farmcpu GLM
Seeddensity(g/cm^3) 2_27144 VU08 36984438 1.7853934213866E-11 0.907010463957269 0.24089
Seeddensity(g/cm^3) 2_13819 VU08 21705264 3.98653459293212E-09 0.782521980037194 0.25771
Seeddensity(g/cm^3) 2_07286 VU01 38953729 3.16663946775461E-07 0.589126094555234 0.31282
我已经检查了多个 awk 命令,但这是跨两个文件执行一项工作的最接近的命令:
awk 'NR==FNR{ a[,,,]=; next } { s=SUBSEP; k= s s s }k in a{ print [=14=],a[k] }' File1 File2 > output
join <(sort File1) <(sort File2) | join - <(sort File3) | join - <(sort File4) | join - <(sort File5) | join - <(sort File6) > output
我认为连接不起作用,因为第一列在文件中不相同,所以我尝试了这个命令:
join -t, -j3 -o 1.2,1.3,1.4,1.5,1.6,2.6,3.6,4.6,5.6,6.6 <(sort -k 3 File1) <(sort -k 3 File2) <(sort -k 3 File3) <(sort -k 3 File4) <(sort -k 3 File5) <(sort -k 3 File6) > output
但是我收到一条错误消息:
加入:字段规范中的无效文件编号:“3.6”
对于两个文件,以下命令有效,但我不确定如何将它用于多个文件:
join -t, -j3 -o 1.2,1.3,1.4,1.5,1.6,2.6 <(sort -k 3 File1) <(sort -k 3 File2) > output
假设您确实需要 CSV 输出,然后将 GNU awk 用于 ARGIND:
$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = FS FS }
ARGIND < (ARGC-1) {
val[key,ARGIND] =
next
}
{
sfx = ""
for (i=1; i<ARGIND; i++) {
if ( (key,i) in val ) {
sfx = sfx OFS val[key,i]
}
else {
next
}
}
print , , , , sfx
}
$ awk -f tst.awk file2 file3 file1
Seeddensity(g/cm^3),1_0002,VU10,37586764,0.458533399568206,0.907010463957269,0.24089
Seeddensity(g/cm^3),1_0004,VU08,37687622,0.548181169267479,0.782521980037194,0.25771
Seeddensity(g/cm^3),1_0006,VU02,6629660,0.553099787284982,0.589126094555234,0.31282
对于任何其他 awk,只需在脚本开头添加一行 FNR==1 { ARGIND++ }
。
我有多个 csv 格式的文件(六个文件)。我正在尝试比较多个文件中的 $3、$4、$5,如果匹配则打印所有文件中的 $6 以及文件 1 中的列 $2、$3、$4、$5。
输入文件 1:
Blink,Seeddensity(g/cm^3),1_0002,VU10,37586764,0.458533399568206
Blink,Seeddensity(g/cm^3),1_0004,VU08,37687622,0.548181169267479
Blink,Seeddensity(g/cm^3),1_0006,VU02,6629660,0.553099787284982
输入文件2:
Farmcpu,Seeddensity(g/cm^3),1_0002,VU10,37586764,0.907010463957269
Farmcpu,Seeddensity(g/cm^3),1_0004,VU08,37687622,0.782521980037194
Farmcpu,Seeddensity(g/cm^3),1_0006,VU02,6629660,0.589126094555234
输入文件 3:
GLM,Seeddensity(g/cm^3),1_0002,VU10,37586764,0.24089
GLM,Seeddensity(g/cm^3),1_0004,VU08,37687622,0.25771
GLM,Seeddensity(g/cm^3),1_0006,VU02,6629660,0.31282
期望的输出:
Trait Marker Chr Pos Blink Farmcpu GLM
Seeddensity(g/cm^3) 2_27144 VU08 36984438 1.7853934213866E-11 0.907010463957269 0.24089
Seeddensity(g/cm^3) 2_13819 VU08 21705264 3.98653459293212E-09 0.782521980037194 0.25771
Seeddensity(g/cm^3) 2_07286 VU01 38953729 3.16663946775461E-07 0.589126094555234 0.31282
我已经检查了多个 awk 命令,但这是跨两个文件执行一项工作的最接近的命令:
awk 'NR==FNR{ a[,,,]=; next } { s=SUBSEP; k= s s s }k in a{ print [=14=],a[k] }' File1 File2 > output
join <(sort File1) <(sort File2) | join - <(sort File3) | join - <(sort File4) | join - <(sort File5) | join - <(sort File6) > output
我认为连接不起作用,因为第一列在文件中不相同,所以我尝试了这个命令:
join -t, -j3 -o 1.2,1.3,1.4,1.5,1.6,2.6,3.6,4.6,5.6,6.6 <(sort -k 3 File1) <(sort -k 3 File2) <(sort -k 3 File3) <(sort -k 3 File4) <(sort -k 3 File5) <(sort -k 3 File6) > output
但是我收到一条错误消息: 加入:字段规范中的无效文件编号:“3.6”
对于两个文件,以下命令有效,但我不确定如何将它用于多个文件:
join -t, -j3 -o 1.2,1.3,1.4,1.5,1.6,2.6 <(sort -k 3 File1) <(sort -k 3 File2) > output
假设您确实需要 CSV 输出,然后将 GNU awk 用于 ARGIND:
$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = FS FS }
ARGIND < (ARGC-1) {
val[key,ARGIND] =
next
}
{
sfx = ""
for (i=1; i<ARGIND; i++) {
if ( (key,i) in val ) {
sfx = sfx OFS val[key,i]
}
else {
next
}
}
print , , , , sfx
}
$ awk -f tst.awk file2 file3 file1
Seeddensity(g/cm^3),1_0002,VU10,37586764,0.458533399568206,0.907010463957269,0.24089
Seeddensity(g/cm^3),1_0004,VU08,37687622,0.548181169267479,0.782521980037194,0.25771
Seeddensity(g/cm^3),1_0006,VU02,6629660,0.553099787284982,0.589126094555234,0.31282
对于任何其他 awk,只需在脚本开头添加一行 FNR==1 { ARGIND++ }
。