比较两行,如果模式在任何顺序行的两列之间重复,则只打印一行
Compare two rows and print only one if a pattern repeats between two columns in any order row
使用 awk 这应该相当简单(希望如此),但我找不到解决方案。
我有一个文件,如果第 1 列和第 2 列的字符串组合在任何其他行中重复,我想将每一行相互比较我只想打印第一个匹配项:
cat file.csv
alpha_3,alpha_47,100,60,0,0,1,60,1,60,8.21E-29,111
alpha_47,alpha_3,100,60,0,0,1,60,1,60,8.21E-29,111
beta_86,beta_12,100,61,0,0,1,61,1,61,2.33E-29,113
beta_86,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
beta_12,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
beta_14,beta_12,100,61,0,0,1,61,1,61,2.33E-29,113
#command
This seems to be working but I have to extract the first two columns,
and I can't print the first instance of the match
awk -F "," '{print , }' file.csv | awk -F' ' '!seen[ FS ]; {seen[[=10=]]++}'
alpha_3 alpha_47
beta_86 beta_12
beta_86 beta_14
beta_12 beta_14
But it doesn't print the whole line and if I try without selecting the first two columns it doesn't work.
#desired output
alpha_3,alpha_47,100,60,0,0,1,60,1,60,8.21E-29,111
beta_86,beta_12,100,61,0,0,1,61,1,61,2.33E-29,113
beta_86,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
beta_12,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
我正在学习 awk(仍然)所以如果有人能提供解决方案并解释他们的代码会更好!
当想要比较复合值而不考虑顺序时,一般的解决方案是对用于创建数组索引的键进行排序。只给定 2 个键,减少到只比较它们并始终以相同的顺序连接它们(例如,最大的优先),无论它们的输入顺序如何:
$ awk -F, '!seen[> ? FS : FS ]++' file.csv
alpha_3,alpha_47,100,60,0,0,1,60,1,60,8.21E-29,111
beta_86,beta_12,100,61,0,0,1,61,1,61,2.33E-29,113
beta_86,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
beta_12,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
使用 awk 这应该相当简单(希望如此),但我找不到解决方案。 我有一个文件,如果第 1 列和第 2 列的字符串组合在任何其他行中重复,我想将每一行相互比较我只想打印第一个匹配项:
cat file.csv
alpha_3,alpha_47,100,60,0,0,1,60,1,60,8.21E-29,111
alpha_47,alpha_3,100,60,0,0,1,60,1,60,8.21E-29,111
beta_86,beta_12,100,61,0,0,1,61,1,61,2.33E-29,113
beta_86,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
beta_12,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
beta_14,beta_12,100,61,0,0,1,61,1,61,2.33E-29,113
#command
This seems to be working but I have to extract the first two columns,
and I can't print the first instance of the match
awk -F "," '{print , }' file.csv | awk -F' ' '!seen[ FS ]; {seen[[=10=]]++}'
alpha_3 alpha_47
beta_86 beta_12
beta_86 beta_14
beta_12 beta_14
But it doesn't print the whole line and if I try without selecting the first two columns it doesn't work.
#desired output
alpha_3,alpha_47,100,60,0,0,1,60,1,60,8.21E-29,111
beta_86,beta_12,100,61,0,0,1,61,1,61,2.33E-29,113
beta_86,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
beta_12,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
我正在学习 awk(仍然)所以如果有人能提供解决方案并解释他们的代码会更好!
当想要比较复合值而不考虑顺序时,一般的解决方案是对用于创建数组索引的键进行排序。只给定 2 个键,减少到只比较它们并始终以相同的顺序连接它们(例如,最大的优先),无论它们的输入顺序如何:
$ awk -F, '!seen[> ? FS : FS ]++' file.csv
alpha_3,alpha_47,100,60,0,0,1,60,1,60,8.21E-29,111
beta_86,beta_12,100,61,0,0,1,61,1,61,2.33E-29,113
beta_86,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
beta_12,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113