Notepad++ 正则表达式在一行中查找字符串并删除确切字符串的重复项

Question

任何人都知道如何匹配随机字符串，然后在文件的每一行中删除和重新出现相同的字符串。

基本上我有一个文件：

00101  blah 0000202 thisisasentencethisisasentence 99929
00102  blah 0000202 thisisasentenc1thisisasentenc1 999292

我想删除重复的句子所以它 returns:

00101  blah 0000202 thisisasentence 99929
00102  blah 0000202 thisisasentenc1 999292

宽度不固定或类似的东西。

我认为这很接近，但我不太了解正则表达式，它突出显示了文件中除最后一行以外的所有内容 - 正确找到了重复项，但只有一次。 Removing duplicate strings/words(not lines) using RegEx(notepad++)

请注意，我还可以使用以下内容来识别每行的哪些部分是重复的 - 它会突出显示重复的值 (thisisasentencethisisasentence)，但我不知道如何拆分它

(.{5,})

如有任何帮助，我们将不胜感激，谢谢。

编辑我可以重新格式化以创建逗号分隔（在某种程度上）：（请注意，重复的字符串中可能存在逗号，但不要担心）

00101,blah,0000202,thisisasentencethisisasentence,99929
00102,blah,0000202,thisisasentenc1thisisasentenc1,999292

Answer 1

您说您对 (.{5,}) 匹配的内容感到满意。因此，只需使用作为替换值。它会自动用文本的单个副本替换重复的部分及其副本。

Answer 2

您可以在 notepad++ 中使用此模式并使用空字符串作为替换：

^(?>\S+[^\S\n]+){3,}?(\S+?)\K(?!\S)

demo

图案详情：

^        # anchors for the start of the line (by default in notepad++)
(?>            # atomic group: a column and the following space
    \S+          # all that is not a white-space character 
    [^\S\n]+     # white-spaces except newlines
){3,}?         # repeat 3 or more times (non-greedy) until you find
(\S+?)\K(?!\S)  # a column with a duplicate

最后一个子模式的详细信息：

(\S+?)   # capture one or more non-white characters
         # (non-greedy: until (?!\S) succeeds)
\K       # discard all previous characters from whole match result
       # back-reference to the capture group 1
(?!\S)   # ensure that the end of the column is reached

注意：在 \S+? 中使用 {5,} 而不是 +（因此 \S{5,}?）是一个好主意，如果您确定列至少包含五个字符.

Notepad++ 正则表达式在一行中查找字符串并删除确切字符串的重复项

Notepad++ Regex to find string on a line and remove duplicates of the exact string

regex

notepad++

duplicates