如何删除分隔文件中标签之间的空格?

How to remove spaces between tags in a delimited file?

我有这个 table 从 MySQL 系统转储,虽然它遵循 RFC 标准,但它似乎在 HTML 文本的列中添加了不需要的 space被存储。例如:

   "2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"
       <div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"

这是大约 30K 行中的一个,所以我想找出一种聪明的方法来删除 " 和

awk '{=;printf [=11=]}' 

还有这种作品,但它把所有东西都混成一行,这不是我想要的。我想保留 CSV 转储中的换行符。我很想听听您对如何解决这个问题的想法。

你可以用 perl:

perl -0777 -i -pe 's/"\K\s+(?=<div)//g' file

详情

  • 0777 将文件压缩成单个字符串,以便模式可以匹配换行符序列
  • -i - 文件内联替换开启
  • "\K\s+(?=<div) - 匹配从 \K 的匹配值中删除的 " 字符,然后消耗一个或多个白色 space(使用 \s+) 然后 <div 必须紧随其后,匹配被替换为空字符串
  • g 替换所有匹配项。

你可以用 GNU 实现同样的效果 sed:

sed -i -Ez 's/"\s+<div/"<div/g' file

其中 -i 启用就地文件替换,-E 启用 POSIX ERE 正则表达式语法,z 将文件文本拉入模式 space 其中正则表达式模式的换行符是“可见的”。

以下将 GNU awk 用于 multi-char RS、RT 和 gensub() 即使您的输入文件很大,它也可以工作,因为它不会将整个文件读入内存,它只是读取字符串一次用 "<spaces>< 或换行符分隔:

$ awk -v RS='"\s+<|\n' '{printf "%s%s", [=10=], gensub(/"\s+</,"\"<",1,RT)}' file
   "2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"

我假设当您在问题中说 and possibly others 时,您指的是其他情况,例如 "<spaces><div>,其中有一个 ",然后是空格,然后是一个以 [=15= 开头的标签] 但这显然只是一个猜测。

假设您的要求是删除 <div 标签开始前的 space,您可以试试这个 GNU sed

$ sed -z 's/\(\"\)[[:space:]]\+\(<div .*\)/\n/' input_file
   "2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"
<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"

仅使用您显示的示例,请尝试以下 awk 代码。用 GNU awk 编写和测试。简单的解释是,将 RS(记录分隔符)设置为 null 并在主程序中,全局替换新行后跟空格后跟 <div 和行中的 <div 并打印行 awkish 方式使用 1.

awk -v RS="" '{gsub(/\n+[[:space:]]+<div/,"<div")} 1' Input_file