sed 拆分可能存在或不存在的子字符串
sed split substring that may or may not be present
尝试根据可能存在或不存在的文本拆分文本列。
示例文件:
TEXT1D1NEWBWP210HTEXT2
TEXT1D1BWP210HTEXT2
预期输出:
TEXT1D1NEWBWP210HTEXT2 NEWBWP 210H
TEXT1D1BWP210HTEXT2 BWP 210H
cmd used --- 期待“?”将检查子字符串“NEW”是否存在并打印它是否存在。
cat <text_file> | sed -e 's/.*\(\s*\)\(NEW\)\?\(BWP\)\([0-9]\+\)H.*/[=12=] H/'
上述命令的输出是
TEXT1D1NEWBWP210HTEXT2 BWP 210H
TEXT1D1BWP210HTEXT2 BWP 210H
不确定我在这里做错了什么...:)
使用sed
$ sed 's/\(\(NEW\)\?BWP\)\([^A-Z]*.\).*/& /' input_file
TEXT1D1NEWBWP210HTEXT2 NEWBWP 210H
TEXT1D1BWP210HTEXT2 BWP 210H
what I am doing wrong here
呃,这会很难,并且与正则表达式的工作原理密切相关。正则表达式从左到右匹配。它贪婪地匹配——匹配所有的东西,直到它不能再匹配为止。然后它返回并匹配“从末尾开始”的字符串。因为它从末尾开始匹配 ~BWP
,所以 NEW
永远不会出现。
.*\(\s*\)\(NEW\)\?\(BWP\)\([0-9]\+\)H.*
Events:
^^ - matches everything
^^^ - matches nothing (end of string)
^^^^^^^^ - matches nothing (end of string)
^^^ - engine is at the end of string
so it goes back until BWP is matched
^^ - matches 'TEXT1D1NEWBWP210HTEXT' (from the back)
^^^ - does not match
^^ - matches 'TEXT1D1NEWBWP210HTEX' (from the back)
^^^ - does not match
^^ ^^^ - ^^^ etc. for each character from the end
^^ - matches 'TEXT1D1NEWB' (from the back)
^^^ - does not match
^^ - matches 'TEXT1D1NEW' (from the back)
^^^ - matches 'BWP'
^^^.. - regex engine continues
您可以在 https://www.regular-expressions.info/repeat.html#lazy 阅读更多内容。
无论如何,你必须编程:
$ sed -e '
s/.*\s*\(NEWBWP\)\([0-9]\+H\).*/[=11=] /;
t a; # if the above s was successful, go to a
s/.*\s*\(BWP\)\([0-9]\+H\).*/[=11=] /;
: a;
' <<<$'TEXT1D1NEWBWP210HTEXT2\nTEXT1D1BWP210HTEXT2'
TEXT1D1NEWBWP210HTEXT2 NEWBWP 210H
TEXT1D1BWP210HTEXT2 BWP 210H
这可能适合您 (GNU sed):
sed -E 's/(NEWBWP|BWP)([0-9]+H).*/& /' file
交替 |
从左到右工作,因此如果 NEWBWP
不匹配,则将尝试 BWP
。
为 -E
使用 GNU 或 BSD sed,因此您不需要所有这些反斜杠(您已经为 \s
使用 GNU sed):
$ sed -Ee 's/((NEW)?BWP)([0-9]+)H.*/& H/' file
TEXT1D1NEWBWP210HTEXT2 NEWBWP 210H
TEXT1D1BWP210HTEXT2 BWP 210H
您的正则表达式的主要问题是初始 .*
会消耗可选的 NEW
(如果存在)。
尝试根据可能存在或不存在的文本拆分文本列。
示例文件:
TEXT1D1NEWBWP210HTEXT2
TEXT1D1BWP210HTEXT2
预期输出:
TEXT1D1NEWBWP210HTEXT2 NEWBWP 210H
TEXT1D1BWP210HTEXT2 BWP 210H
cmd used --- 期待“?”将检查子字符串“NEW”是否存在并打印它是否存在。
cat <text_file> | sed -e 's/.*\(\s*\)\(NEW\)\?\(BWP\)\([0-9]\+\)H.*/[=12=] H/'
上述命令的输出是
TEXT1D1NEWBWP210HTEXT2 BWP 210H
TEXT1D1BWP210HTEXT2 BWP 210H
不确定我在这里做错了什么...:)
使用sed
$ sed 's/\(\(NEW\)\?BWP\)\([^A-Z]*.\).*/& /' input_file
TEXT1D1NEWBWP210HTEXT2 NEWBWP 210H
TEXT1D1BWP210HTEXT2 BWP 210H
what I am doing wrong here
呃,这会很难,并且与正则表达式的工作原理密切相关。正则表达式从左到右匹配。它贪婪地匹配——匹配所有的东西,直到它不能再匹配为止。然后它返回并匹配“从末尾开始”的字符串。因为它从末尾开始匹配 ~BWP
,所以 NEW
永远不会出现。
.*\(\s*\)\(NEW\)\?\(BWP\)\([0-9]\+\)H.*
Events:
^^ - matches everything
^^^ - matches nothing (end of string)
^^^^^^^^ - matches nothing (end of string)
^^^ - engine is at the end of string
so it goes back until BWP is matched
^^ - matches 'TEXT1D1NEWBWP210HTEXT' (from the back)
^^^ - does not match
^^ - matches 'TEXT1D1NEWBWP210HTEX' (from the back)
^^^ - does not match
^^ ^^^ - ^^^ etc. for each character from the end
^^ - matches 'TEXT1D1NEWB' (from the back)
^^^ - does not match
^^ - matches 'TEXT1D1NEW' (from the back)
^^^ - matches 'BWP'
^^^.. - regex engine continues
您可以在 https://www.regular-expressions.info/repeat.html#lazy 阅读更多内容。
无论如何,你必须编程:
$ sed -e '
s/.*\s*\(NEWBWP\)\([0-9]\+H\).*/[=11=] /;
t a; # if the above s was successful, go to a
s/.*\s*\(BWP\)\([0-9]\+H\).*/[=11=] /;
: a;
' <<<$'TEXT1D1NEWBWP210HTEXT2\nTEXT1D1BWP210HTEXT2'
TEXT1D1NEWBWP210HTEXT2 NEWBWP 210H
TEXT1D1BWP210HTEXT2 BWP 210H
这可能适合您 (GNU sed):
sed -E 's/(NEWBWP|BWP)([0-9]+H).*/& /' file
交替 |
从左到右工作,因此如果 NEWBWP
不匹配,则将尝试 BWP
。
为 -E
使用 GNU 或 BSD sed,因此您不需要所有这些反斜杠(您已经为 \s
使用 GNU sed):
$ sed -Ee 's/((NEW)?BWP)([0-9]+)H.*/& H/' file
TEXT1D1NEWBWP210HTEXT2 NEWBWP 210H
TEXT1D1BWP210HTEXT2 BWP 210H
您的正则表达式的主要问题是初始 .*
会消耗可选的 NEW
(如果存在)。