通过正则表达式匹配替换为通用词 URL 消除所有 URL

eliminate all URLs via regular expression matching replace with generic word URL

我是正则表达式匹配的新手。假设我想在一个用逗号分隔的文本文件中找到所有 URL 并将它们替换为单词 "url".

user,user,' http://twitpic.com/2y1zl - awww, that\'s a bummer.    you shoulda got david carr of third day to do it. ;d',0   
user,user,'is upset that he can\'t update his facebook by texting it... and might cry as a result  school today also. blah!',0   
user,user,' i dived many times for the ball. http://twitpic.com/2y1zl managed to save 50\%  the rest go out of bounds',0  
user,user,'my whole body feels itchy and like its on fire ',0  
user,user,' no, it\'s not behaving at all. i\'m mad. why am i here? because i can\'t see you all over there. ',0  
user,user,' not the whole crew ',0   
user,user,'need a hug ',0   
user,user,' hey  long time no see! yes.. rains a bit ,only a bit  lol , i\'m fine thanks , how\'s you ?',0    
user,user,'_k nope they didn\'t have it ',0   
user,user,'que me muera ? ',0   
user,user,'spring break in plain city... it\'s snowing ',0  
user,user,'i just re-pierced my ears ',0   

想这样实现一个输出

user,user,' *url*- awww, that\'s a bummer.    you shoulda got david carr of third day to do it. ;d',0   
user,user,'is upset that he can\'t update his facebook by texting it... and might cry as a result  school today also. blah!',0   
user,user,' i dived many times for the ball. *url* managed to save 50\%  the rest go out of bounds',0  
user,user,'my whole body feels itchy and like its on fire ',0  
user,user,' no, it\'s not behaving at all. i\'m mad. why am i here? because i can\'t see you all over there. ',0  
user,user,' not the whole crew ',0   
user,user,'need a hug ',0   
user,user,' hey  long time no see! yes.. rains a bit ,only a bit  lol , i\'m fine thanks , how\'s you ?',0    
user,user,'nope they didn\'t have it ',0   
user,user,'que me muera ? ',0   
user,user,'spring break in plain city... it\'s snowing ',0  
user,user,'i just re-pierced my ears ',0   

我试过 sed

sed -e 's/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$//URL/' filename.txt  |less

查找和替换正则表达式不起作用

默认的 GNU sed 正则表达式需要很多反斜杠(参考:https://www.gnu.org/software/gnulib/manual/html_node/Regular-expression-syntaxes.html#Regular-expression-syntaxes)。此外,sed 正则表达式不理解 perl \d\w.

匹配 URL 是一个相当困难的问题。从

开始
sed  's@https\?://[^[:blank:]]\+@*url*@g' file

这为 s/// 命令使用了备用分隔符以避免需要转义斜杠。

如果您的 URL 与任何后跟空格或任何不能在 URL 中的内容分隔开,这应该有效。

我在这里没有处理非 http URLs 或 user/password 组合;只是一个 http/https 后跟一个字符序列,允许在 URL.

sed -e 's@https\?://[][0-9a-Z._~:/?#@!$&()*+,;=%'\''-]\+@URL@g' 
  • 我使用 @ 作为分隔符以简化斜杠的处理。
  • 由于允许使用方括号和破折号,我将它们分别直接放在字符 class 的开头和结尾。
  • 为了捕捉单引号,这必须首先按字面意思插入,然后在内部转义,所以它最终是:'\''