如何批量删除巨大文本文件中的重复行

Question

我有一个超过 50GB 的文本文件。它包含很多行，每行平均约 15 个字符。我希望每一行都是唯一的（区分大小写）。因此，如果一行与另一行完全相同，则必须将其删除，而不更改其他行的顺序或以任何方式对文件进行排序。

我的问题与其他问题不同，因为我有一个巨大的文件，无法使用我搜索过的其他解决方案来处理。

我试过：

awk !seen[[=10=]]++ bigtextfile.txt > dublicatesremoved.txt

它启动得很好而且很快，但很快我就收到以下错误：

awk: (FILENAME=bigtextfile.txt FNR=19083509) fatal: more_nodes: nextfree: can't allocate 4000 bytes of memory (Not enough space)

当输出文件大约200MB时出现以上错误

有没有其他快速方法可以让我在 windows 上做同样的事情？

Answer 1

您可以在 Windows 之上的 UNIX 机器或 Cygwin 上执行此操作：

$ cat file
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.

Loud the winds howl, loud the waves roar,
Speed, bonnie boat, like a bird on the wing,
Thunderclaps rend the air;
Onward! the sailors cry;
Baffled, our foes stand by the shore,
Carry the lad that's born to be King
Follow they will not dare.
Over the sea to Skye.

.

$ cat -n file | sort -k2 -u | sort -n | cut -f2-
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.

Loud the winds howl, loud the waves roar,
Thunderclaps rend the air;
Baffled, our foes stand by the shore,
Follow they will not dare.

上面唯一试图一次处理整个文件的命令是 sort 并且 sort 旨在使用分页等来准确处理大文件（参见 https://unix.stackexchange.com/q/279096/133219) 所以恕我直言，这是您能够做到这一点的最佳机会。

从 cat -n file 开始，然后将每个命令一次一个地添加到管道中以查看它在做什么（见下文），但它只是先添加行号，这样我们就可以按内容进行唯一排序获取唯一值，然后按原始行号排序以恢复原始行顺序，然后删除我们在第一步添加的行号：

$ cat -n file
     1  Speed, bonnie boat, like a bird on the wing,
     2  Onward! the sailors cry;
     3  Carry the lad that's born to be King
     4  Over the sea to Skye.
     5
     6  Loud the winds howl, loud the waves roar,
     7  Speed, bonnie boat, like a bird on the wing,
     8  Thunderclaps rend the air;
     9  Onward! the sailors cry;
    10  Baffled, our foes stand by the shore,
    11  Carry the lad that's born to be King
    12  Follow they will not dare.
    13  Over the sea to Skye.
    14

.

$ cat -n file | sort -k2 -u
     5
    10  Baffled, our foes stand by the shore,
     3  Carry the lad that's born to be King
    12  Follow they will not dare.
     6  Loud the winds howl, loud the waves roar,
     2  Onward! the sailors cry;
     4  Over the sea to Skye.
     1  Speed, bonnie boat, like a bird on the wing,
     8  Thunderclaps rend the air;

.

$ cat -n file | sort -k2 -u | sort -n
     1  Speed, bonnie boat, like a bird on the wing,
     2  Onward! the sailors cry;
     3  Carry the lad that's born to be King
     4  Over the sea to Skye.
     5
     6  Loud the winds howl, loud the waves roar,
     8  Thunderclaps rend the air;
    10  Baffled, our foes stand by the shore,
    12  Follow they will not dare.

.

$ cat -n file | sort -k2 -u | sort -n | cut -f2-
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.

Loud the winds howl, loud the waves roar,
Thunderclaps rend the air;
Baffled, our foes stand by the shore,
Follow they will not dare.

如何批量删除巨大文本文件中的重复行

How to remove duplicate lines on HUGE text file in batch

windows

bash

awk

cygwin

batch-file