如何删除分隔文件中标签之间的空格?
How to remove spaces between tags in a delimited file?
我有这个 table 从 MySQL 系统转储,虽然它遵循 RFC 标准,但它似乎在 HTML 文本的列中添加了不需要的 space被存储。例如:
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"
<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"
这是大约 30K 行中的一个,所以我想找出一种聪明的方法来删除 " 和
awk '{=;printf [=11=]}'
还有这种作品,但它把所有东西都混成一行,这不是我想要的。我想保留 CSV 转储中的换行符。我很想听听您对如何解决这个问题的想法。
你可以用 perl
:
perl -0777 -i -pe 's/"\K\s+(?=<div)//g' file
详情
0777
将文件压缩成单个字符串,以便模式可以匹配换行符序列
-i
- 文件内联替换开启
"\K\s+(?=<div)
- 匹配从 \K
的匹配值中删除的 "
字符,然后消耗一个或多个白色 space(使用 \s+
) 然后 <div
必须紧随其后,匹配被替换为空字符串
g
替换所有匹配项。
你可以用 GNU 实现同样的效果 sed
:
sed -i -Ez 's/"\s+<div/"<div/g' file
其中 -i
启用就地文件替换,-E
启用 POSIX ERE 正则表达式语法,z
将文件文本拉入模式 space 其中正则表达式模式的换行符是“可见的”。
以下将 GNU awk 用于 multi-char RS、RT 和 gensub() 即使您的输入文件很大,它也可以工作,因为它不会将整个文件读入内存,它只是读取字符串一次用 "<spaces><
或换行符分隔:
$ awk -v RS='"\s+<|\n' '{printf "%s%s", [=10=], gensub(/"\s+</,"\"<",1,RT)}' file
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"
我假设当您在问题中说 and possibly others
时,您指的是其他情况,例如 "<spaces><div>
,其中有一个 "
,然后是空格,然后是一个以 [=15= 开头的标签] 但这显然只是一个猜测。
假设您的要求是删除 <div
标签开始前的 space,您可以试试这个 GNU sed
$ sed -z 's/\(\"\)[[:space:]]\+\(<div .*\)/\n/' input_file
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"
<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"
仅使用您显示的示例,请尝试以下 awk
代码。用 GNU awk
编写和测试。简单的解释是,将 RS
(记录分隔符)设置为 null 并在主程序中,全局替换新行后跟空格后跟 <div
和行中的 <div
并打印行 awk
ish 方式使用 1
.
awk -v RS="" '{gsub(/\n+[[:space:]]+<div/,"<div")} 1' Input_file
我有这个 table 从 MySQL 系统转储,虽然它遵循 RFC 标准,但它似乎在 HTML 文本的列中添加了不需要的 space被存储。例如:
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"
<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"
这是大约 30K 行中的一个,所以我想找出一种聪明的方法来删除 " 和
awk '{=;printf [=11=]}'
还有这种作品,但它把所有东西都混成一行,这不是我想要的。我想保留 CSV 转储中的换行符。我很想听听您对如何解决这个问题的想法。
你可以用 perl
:
perl -0777 -i -pe 's/"\K\s+(?=<div)//g' file
详情
0777
将文件压缩成单个字符串,以便模式可以匹配换行符序列-i
- 文件内联替换开启"\K\s+(?=<div)
- 匹配从\K
的匹配值中删除的"
字符,然后消耗一个或多个白色 space(使用\s+
) 然后<div
必须紧随其后,匹配被替换为空字符串g
替换所有匹配项。
你可以用 GNU 实现同样的效果 sed
:
sed -i -Ez 's/"\s+<div/"<div/g' file
其中 -i
启用就地文件替换,-E
启用 POSIX ERE 正则表达式语法,z
将文件文本拉入模式 space 其中正则表达式模式的换行符是“可见的”。
以下将 GNU awk 用于 multi-char RS、RT 和 gensub() 即使您的输入文件很大,它也可以工作,因为它不会将整个文件读入内存,它只是读取字符串一次用 "<spaces><
或换行符分隔:
$ awk -v RS='"\s+<|\n' '{printf "%s%s", [=10=], gensub(/"\s+</,"\"<",1,RT)}' file
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"
我假设当您在问题中说 and possibly others
时,您指的是其他情况,例如 "<spaces><div>
,其中有一个 "
,然后是空格,然后是一个以 [=15= 开头的标签] 但这显然只是一个猜测。
假设您的要求是删除 <div
标签开始前的 space,您可以试试这个 GNU sed
$ sed -z 's/\(\"\)[[:space:]]\+\(<div .*\)/\n/' input_file
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"
<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"
仅使用您显示的示例,请尝试以下 awk
代码。用 GNU awk
编写和测试。简单的解释是,将 RS
(记录分隔符)设置为 null 并在主程序中,全局替换新行后跟空格后跟 <div
和行中的 <div
并打印行 awk
ish 方式使用 1
.
awk -v RS="" '{gsub(/\n+[[:space:]]+<div/,"<div")} 1' Input_file