用双空格替换单个空格
Replace single whitespaces with double
我有一堆 XML 文件,其中包含文本(日记的抄本)。句末要求句号后有两个空格。目前,这是部分完成的,但并非在所有情况下:有时在下一句的第一个字符之前的句号之后只有一个空格。
我正在为 Windows 使用 Gitbash,并认为 sed 是要使用的命令,但我不知道正确的正则表达式。我想我需要找到:
period whitespace [some other character]
并替换为
period whitespace whitespace [the same next character]
例如,现在我们有这个:
<p>The spacing after this sentence (two whitespaces) is what is required. By contrast, this sentence has only a single space after the period. This is the next sentence, the last in a paragraph, which correctly has no whitespace at all after the period.</p>
我需要的是这个,每个句点后跟两个空格,段落中的最后一个空格除外。
<p>The double whitespace after this sentence is what is required. This sentence now also has a double space after the period. This is the next sentence, the last in a paragraph, which correctly has no whitespace at all after the period.</p>
使用 sed
你可以这样做:
sed -e "s/\. \</\. /"
更改如下
$ sed -e "s/\. \b/\. /g" test.txt > fixed.txt
$ diff test.txt fixed.txt
1c1
< <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas vehicula placerat nisl, bibendum blandit tortor pharetra ut. Morbi nec tellus ultrices, porta felis et, dapibus diam. Phasellus vehicula ante ac urna elementum lacinia.</p>
---
> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas vehicula placerat nisl, bibendum blandit tortor pharetra ut. Morbi nec tellus ultrices, porta felis et, dapibus diam. Phasellus vehicula ante ac urna elementum lacinia.</p>
sed
有点受限(你可以使用 grep
或 perl
吗?)无论如何你可以使用这样的正则表达式(GNU sed 特定 ):
sed -i -r 's/\. ([^ ])/. /g' <file>
传奇
-i # sed switch: replace inplace in the file passed as parameter
-r # use extended regex
/\. ([^ ]) # match a single dot followed by a space and by a not-space
/. / # replace with a dot follower by 2 spaces and by the previous non-space char
g # apply multiple times per line
如果需要更多测试用例,可以改进正则表达式。
正如@ghoti 所证明的那样,答案是 GNU sed 特定的。我认为更通用的方法(没有扩展的正则表达式和就地)可能是:
sed 's/\. \([^ ]\)/. /g' <input.file> > <output.file>
您可以使用 perl
:
perl -pe 's-\. (?! )-\. -g' test
示例:
$ cat test
This is. A simple. Test to check. That it works!
$ perl -pe 's-\. (?! )-\. -g' test
This is. A simple. Test to check. That it works!
正则表达式 \. (?! )
匹配句点,后跟 space,即 not 后跟另一个 space.
您想找出点后出现的所有白色 space 并记住下一个字符。然后替换为“。”以及记住的字符是什么。记忆部分称为 "tagged expression".
因此,搜索 \. +([^ ])
,这意味着 "dot, some spaces, [tagged expression]something that isn't a space[end tagged expression]"
替换为.
这是一个 sed 示例:
$ echo '>zzz. xxx. yyy.<' | sed -r -e 's/\. +([^ ])/. /g'
>zzz. xxx. yyy.<
我有一堆 XML 文件,其中包含文本(日记的抄本)。句末要求句号后有两个空格。目前,这是部分完成的,但并非在所有情况下:有时在下一句的第一个字符之前的句号之后只有一个空格。
我正在为 Windows 使用 Gitbash,并认为 sed 是要使用的命令,但我不知道正确的正则表达式。我想我需要找到:
period whitespace [some other character]
并替换为
period whitespace whitespace [the same next character]
例如,现在我们有这个:
<p>The spacing after this sentence (two whitespaces) is what is required. By contrast, this sentence has only a single space after the period. This is the next sentence, the last in a paragraph, which correctly has no whitespace at all after the period.</p>
我需要的是这个,每个句点后跟两个空格,段落中的最后一个空格除外。
<p>The double whitespace after this sentence is what is required. This sentence now also has a double space after the period. This is the next sentence, the last in a paragraph, which correctly has no whitespace at all after the period.</p>
使用 sed
你可以这样做:
sed -e "s/\. \</\. /"
更改如下
$ sed -e "s/\. \b/\. /g" test.txt > fixed.txt
$ diff test.txt fixed.txt
1c1
< <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas vehicula placerat nisl, bibendum blandit tortor pharetra ut. Morbi nec tellus ultrices, porta felis et, dapibus diam. Phasellus vehicula ante ac urna elementum lacinia.</p>
---
> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas vehicula placerat nisl, bibendum blandit tortor pharetra ut. Morbi nec tellus ultrices, porta felis et, dapibus diam. Phasellus vehicula ante ac urna elementum lacinia.</p>
sed
有点受限(你可以使用 grep
或 perl
吗?)无论如何你可以使用这样的正则表达式(GNU sed 特定 ):
sed -i -r 's/\. ([^ ])/. /g' <file>
传奇
-i # sed switch: replace inplace in the file passed as parameter
-r # use extended regex
/\. ([^ ]) # match a single dot followed by a space and by a not-space
/. / # replace with a dot follower by 2 spaces and by the previous non-space char
g # apply multiple times per line
如果需要更多测试用例,可以改进正则表达式。
正如@ghoti 所证明的那样,答案是 GNU sed 特定的。我认为更通用的方法(没有扩展的正则表达式和就地)可能是:
sed 's/\. \([^ ]\)/. /g' <input.file> > <output.file>
您可以使用 perl
:
perl -pe 's-\. (?! )-\. -g' test
示例:
$ cat test
This is. A simple. Test to check. That it works!
$ perl -pe 's-\. (?! )-\. -g' test
This is. A simple. Test to check. That it works!
正则表达式 \. (?! )
匹配句点,后跟 space,即 not 后跟另一个 space.
您想找出点后出现的所有白色 space 并记住下一个字符。然后替换为“。”以及记住的字符是什么。记忆部分称为 "tagged expression".
因此,搜索 \. +([^ ])
,这意味着 "dot, some spaces, [tagged expression]something that isn't a space[end tagged expression]"
替换为.
这是一个 sed 示例:
$ echo '>zzz. xxx. yyy.<' | sed -r -e 's/\. +([^ ])/. /g'
>zzz. xxx. yyy.<