如何使用 AWK and/or SED 消除 AWK 搜索模式中的重复项
How to emliminate duplicates among AWK search patterns using AWK and/or SED
我有以下名为 x.txt 的文件(仅摘录):
exMap( "0Ba|Mtm|Mtm","Variable Expenses","Accounting & Legal" )
exMap( "gn C[hu]|gn C[hu]|ent Ca","Variable Expenses","Bank – Charges" )
exMap( "t m|e Fee|^Deb|A\/C|hly pr|ged Ov|^Visa","Fixed Expenses","Bank – Charges" )
exMap( "ATM|ATM|ATM|ATM|ATM|ATM|ATM|ATM|ATM|ATM|^Fix C|R US","Variable Expenses","Bank – Withdrawals" )
exMap( "Acci","Variable Expenses","Business – ACC" )
exMap( "use St$|Pgg","Variable Expenses","Business – Miscellaneous" )
exMap( "utd$|^Ellm|^Ellm|a Cy|^Stihl|^Stihl|a Mow","Variable Expenses","Business – Repairs & Maintenance" )
exMap( "Nzp","Fixed Expenses","Business Services – Mail" )
exMap( "0Ki|^K S","Fixed Expenses","Business – Storage" )
exMap( "2 C|2 C|40D|Tor|Tor|Tor|e of B|^Jay|^Luk|ll J|ty AP|le Au|eemi|eemi|eemi|^N[Zz] S|^N[Zz] S|s The J|P[Bb] T|P[Bb] T|P[Bb] T|mTo|deme|deme|deme|deme|deme|deme","Variable Expenses","Capex" )
exMap( "90C","Variable Expenses","Christian to Crewcut" )
exMap( "10E","Variable Expenses","Christian to Edorne" )
exMap( "0.0Ch|0.0Ch","Variable Expenses","Edorne to Christian" )
exMap( "lyt|-J|lyt|y Ha|NZA|^NZ T|^NZ T|^NZ T|^NZ T|^NZ T|^NZ T|^NZ T|acle Boo","Variable Expenses","Education" )
exMap( "weat|Aca|weat|Aca|weat|^A A|^A A|^A A|^A A|^A A|^A A|^A A|weat","Fixed Expenses","Education" )
exMap( "ntac|0Tr|ntac","Fixed Expenses","Electricity" )
此文件以逗号分隔,由三列组成。第一列包含 awk 正则表达式搜索模式。其中一些是重复的,例如 |Mtm|在第一个或 |ATM|例如在第 4 行。有没有一种聪明的方法可以消除整个文件中的重复项并使用 awk and/or sed 保持管道结构完整?
第一行和第四行所需的输出为:
exMap( "0Ba|Mtm","Variable Expenses","Accounting & Legal" )
exMap( "ATM|^Fix C|R US","Variable Expenses","Bank – Withdrawals" )
使用sed
$ cat rem_dupes.sed
s/\(|[^|]*|\?\"\?\)\+//g
s/\(|\?[^|]*|\"\?\)\+//g
s/\(\([a-z][^|]*\)|[^"]*\)|//g
s/\(\([a-z0-9][^|]*\)|\?[^|]*\)|//
s/\(\([a-z]*|\)[^|]*|\)//
$ sed -f rem_dupes.sed input_file
exMap( "0Ba|Mtm","Variable Expenses","Accounting & Legal" )
exMap( "gn C[hu]|ent Ca","Variable Expenses","Bank – Charges" )
exMap( "t m|e Fee|^Deb|A\/C|hly pr|ged Ov|^Visa","Fixed Expenses","Bank – Charges" )
exMap( "ATM|^Fix C|R US","Variable Expenses","Bank – Withdrawals" )
exMap( "Acci","Variable Expenses","Business – ACC" )
exMap( "use St$|Pgg","Variable Expenses","Business – Miscellaneous" )
exMap( "utd$|^Ellm|a Cy|^Stihl|a Mow","Variable Expenses","Business – Repairs & Maintenance" )
exMap( "Nzp","Fixed Expenses","Business Services – Mail" )
exMap( "0Ki|^K S","Fixed Expenses","Business – Storage" )
exMap( "2 C|40D|Tor|e of B|^Jay|^Luk|ll J|ty AP|le Au|eemi|^N[Zz] S|s The J|P[Bb] T|mTo|deme","Variable Expenses","Capex" )
exMap( "90C","Variable Expenses","Christian to Crewcut" )
exMap( "10E","Variable Expenses","Christian to Edorne" )
exMap( "0.0Ch","Variable Expenses","Edorne to Christian" )
exMap( "lyt|-J|y Ha|NZA|^NZ Tcle Boo","Variable Expenses","Education" )
exMap( "weat|Aca|^A A","Fixed Expenses","Education" )
exMap( "ntac|0Tr","Fixed Expenses","Electricity" )
s/\(|[^|]*|\?\"\?\)\+//g
- 使用分组和反向引用,匹配并添加到缓冲区组 </code> 两个管道之间的任何内容 <code>|
第二个管道可能存在也可能不存在 \?
或双引号 "
,当第一个管道必须存在时,它可能再次存在也可能不存在。然后重复组内匹配的所有内容并添加到要排除的组之外。如果使用反向引用 </code> 找到重复匹配的模式,则将其排除,仅保留原始匹配,然后在替换中返回原始匹配作为反向引用 <code>
s/\(|\?[^|]*|\"\?\)\+//g
- 如上所述,使用分组和反向引用,但这次使初始管道 |
可选,而第二个管道必须存在。
s/\(\([a-z][^|]*\)|[^"]*\)|//g
- 此处使用嵌套分组来匹配所需组中的特定模式。这种嵌套分组允许我们匹配出现在交错序列中的重复项,例如 lyt|-J|lyt|
s/\(\([a-z0-9][^|]*\)|\?[^|]*\)|//
- 同上,但这也将针对整数,交错重复。
s/\(\([a-z]*|\)[^|]*|\)//
- 这是为了清理使用嵌套分组找到的最终交错重复项。
我有以下名为 x.txt 的文件(仅摘录):
exMap( "0Ba|Mtm|Mtm","Variable Expenses","Accounting & Legal" )
exMap( "gn C[hu]|gn C[hu]|ent Ca","Variable Expenses","Bank – Charges" )
exMap( "t m|e Fee|^Deb|A\/C|hly pr|ged Ov|^Visa","Fixed Expenses","Bank – Charges" )
exMap( "ATM|ATM|ATM|ATM|ATM|ATM|ATM|ATM|ATM|ATM|^Fix C|R US","Variable Expenses","Bank – Withdrawals" )
exMap( "Acci","Variable Expenses","Business – ACC" )
exMap( "use St$|Pgg","Variable Expenses","Business – Miscellaneous" )
exMap( "utd$|^Ellm|^Ellm|a Cy|^Stihl|^Stihl|a Mow","Variable Expenses","Business – Repairs & Maintenance" )
exMap( "Nzp","Fixed Expenses","Business Services – Mail" )
exMap( "0Ki|^K S","Fixed Expenses","Business – Storage" )
exMap( "2 C|2 C|40D|Tor|Tor|Tor|e of B|^Jay|^Luk|ll J|ty AP|le Au|eemi|eemi|eemi|^N[Zz] S|^N[Zz] S|s The J|P[Bb] T|P[Bb] T|P[Bb] T|mTo|deme|deme|deme|deme|deme|deme","Variable Expenses","Capex" )
exMap( "90C","Variable Expenses","Christian to Crewcut" )
exMap( "10E","Variable Expenses","Christian to Edorne" )
exMap( "0.0Ch|0.0Ch","Variable Expenses","Edorne to Christian" )
exMap( "lyt|-J|lyt|y Ha|NZA|^NZ T|^NZ T|^NZ T|^NZ T|^NZ T|^NZ T|^NZ T|acle Boo","Variable Expenses","Education" )
exMap( "weat|Aca|weat|Aca|weat|^A A|^A A|^A A|^A A|^A A|^A A|^A A|weat","Fixed Expenses","Education" )
exMap( "ntac|0Tr|ntac","Fixed Expenses","Electricity" )
此文件以逗号分隔,由三列组成。第一列包含 awk 正则表达式搜索模式。其中一些是重复的,例如 |Mtm|在第一个或 |ATM|例如在第 4 行。有没有一种聪明的方法可以消除整个文件中的重复项并使用 awk and/or sed 保持管道结构完整?
第一行和第四行所需的输出为:
exMap( "0Ba|Mtm","Variable Expenses","Accounting & Legal" )
exMap( "ATM|^Fix C|R US","Variable Expenses","Bank – Withdrawals" )
使用sed
$ cat rem_dupes.sed
s/\(|[^|]*|\?\"\?\)\+//g
s/\(|\?[^|]*|\"\?\)\+//g
s/\(\([a-z][^|]*\)|[^"]*\)|//g
s/\(\([a-z0-9][^|]*\)|\?[^|]*\)|//
s/\(\([a-z]*|\)[^|]*|\)//
$ sed -f rem_dupes.sed input_file
exMap( "0Ba|Mtm","Variable Expenses","Accounting & Legal" )
exMap( "gn C[hu]|ent Ca","Variable Expenses","Bank – Charges" )
exMap( "t m|e Fee|^Deb|A\/C|hly pr|ged Ov|^Visa","Fixed Expenses","Bank – Charges" )
exMap( "ATM|^Fix C|R US","Variable Expenses","Bank – Withdrawals" )
exMap( "Acci","Variable Expenses","Business – ACC" )
exMap( "use St$|Pgg","Variable Expenses","Business – Miscellaneous" )
exMap( "utd$|^Ellm|a Cy|^Stihl|a Mow","Variable Expenses","Business – Repairs & Maintenance" )
exMap( "Nzp","Fixed Expenses","Business Services – Mail" )
exMap( "0Ki|^K S","Fixed Expenses","Business – Storage" )
exMap( "2 C|40D|Tor|e of B|^Jay|^Luk|ll J|ty AP|le Au|eemi|^N[Zz] S|s The J|P[Bb] T|mTo|deme","Variable Expenses","Capex" )
exMap( "90C","Variable Expenses","Christian to Crewcut" )
exMap( "10E","Variable Expenses","Christian to Edorne" )
exMap( "0.0Ch","Variable Expenses","Edorne to Christian" )
exMap( "lyt|-J|y Ha|NZA|^NZ Tcle Boo","Variable Expenses","Education" )
exMap( "weat|Aca|^A A","Fixed Expenses","Education" )
exMap( "ntac|0Tr","Fixed Expenses","Electricity" )
s/\(|[^|]*|\?\"\?\)\+//g
- 使用分组和反向引用,匹配并添加到缓冲区组 </code> 两个管道之间的任何内容 <code>|
第二个管道可能存在也可能不存在 \?
或双引号 "
,当第一个管道必须存在时,它可能再次存在也可能不存在。然后重复组内匹配的所有内容并添加到要排除的组之外。如果使用反向引用 </code> 找到重复匹配的模式,则将其排除,仅保留原始匹配,然后在替换中返回原始匹配作为反向引用 <code>
s/\(|\?[^|]*|\"\?\)\+//g
- 如上所述,使用分组和反向引用,但这次使初始管道 |
可选,而第二个管道必须存在。
s/\(\([a-z][^|]*\)|[^"]*\)|//g
- 此处使用嵌套分组来匹配所需组中的特定模式。这种嵌套分组允许我们匹配出现在交错序列中的重复项,例如 lyt|-J|lyt|
s/\(\([a-z0-9][^|]*\)|\?[^|]*\)|//
- 同上,但这也将针对整数,交错重复。
s/\(\([a-z]*|\)[^|]*|\)//
- 这是为了清理使用嵌套分组找到的最终交错重复项。