如何批量删除巨大文本文件中的重复行
How to remove duplicate lines on HUGE text file in batch
我有一个超过 50GB 的文本文件。它包含很多行,每行平均约 15 个字符。我希望每一行都是唯一的(区分大小写)。因此,如果一行与另一行完全相同,则必须将其删除,而不更改其他行的顺序或以任何方式对文件进行排序。
我的问题与其他问题不同,因为我有一个巨大的文件,无法使用我搜索过的其他解决方案来处理。
我试过:
awk !seen[[=10=]]++ bigtextfile.txt > dublicatesremoved.txt
它启动得很好而且很快,但很快我就收到以下错误:
awk: (FILENAME=bigtextfile.txt FNR=19083509) fatal: more_nodes: nextfree: can't allocate 4000 bytes of memory (Not enough space)
当输出文件大约200MB时出现以上错误
有没有其他快速方法可以让我在 windows 上做同样的事情?
您可以在 Windows 之上的 UNIX 机器或 Cygwin 上执行此操作:
$ cat file
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.
Loud the winds howl, loud the waves roar,
Speed, bonnie boat, like a bird on the wing,
Thunderclaps rend the air;
Onward! the sailors cry;
Baffled, our foes stand by the shore,
Carry the lad that's born to be King
Follow they will not dare.
Over the sea to Skye.
.
$ cat -n file | sort -k2 -u | sort -n | cut -f2-
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.
Loud the winds howl, loud the waves roar,
Thunderclaps rend the air;
Baffled, our foes stand by the shore,
Follow they will not dare.
上面唯一试图一次处理整个文件的命令是 sort
并且 sort
旨在使用分页等来准确处理大文件(参见 https://unix.stackexchange.com/q/279096/133219) 所以恕我直言,这是您能够做到这一点的最佳机会。
从 cat -n file
开始,然后将每个命令一次一个地添加到管道中以查看它在做什么(见下文),但它只是先添加行号,这样我们就可以按内容进行唯一排序获取唯一值,然后按原始行号排序以恢复原始行顺序,然后删除我们在第一步添加的行号:
$ cat -n file
1 Speed, bonnie boat, like a bird on the wing,
2 Onward! the sailors cry;
3 Carry the lad that's born to be King
4 Over the sea to Skye.
5
6 Loud the winds howl, loud the waves roar,
7 Speed, bonnie boat, like a bird on the wing,
8 Thunderclaps rend the air;
9 Onward! the sailors cry;
10 Baffled, our foes stand by the shore,
11 Carry the lad that's born to be King
12 Follow they will not dare.
13 Over the sea to Skye.
14
.
$ cat -n file | sort -k2 -u
5
10 Baffled, our foes stand by the shore,
3 Carry the lad that's born to be King
12 Follow they will not dare.
6 Loud the winds howl, loud the waves roar,
2 Onward! the sailors cry;
4 Over the sea to Skye.
1 Speed, bonnie boat, like a bird on the wing,
8 Thunderclaps rend the air;
.
$ cat -n file | sort -k2 -u | sort -n
1 Speed, bonnie boat, like a bird on the wing,
2 Onward! the sailors cry;
3 Carry the lad that's born to be King
4 Over the sea to Skye.
5
6 Loud the winds howl, loud the waves roar,
8 Thunderclaps rend the air;
10 Baffled, our foes stand by the shore,
12 Follow they will not dare.
.
$ cat -n file | sort -k2 -u | sort -n | cut -f2-
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.
Loud the winds howl, loud the waves roar,
Thunderclaps rend the air;
Baffled, our foes stand by the shore,
Follow they will not dare.
我有一个超过 50GB 的文本文件。它包含很多行,每行平均约 15 个字符。我希望每一行都是唯一的(区分大小写)。因此,如果一行与另一行完全相同,则必须将其删除,而不更改其他行的顺序或以任何方式对文件进行排序。
我的问题与其他问题不同,因为我有一个巨大的文件,无法使用我搜索过的其他解决方案来处理。
我试过:
awk !seen[[=10=]]++ bigtextfile.txt > dublicatesremoved.txt
它启动得很好而且很快,但很快我就收到以下错误:
awk: (FILENAME=bigtextfile.txt FNR=19083509) fatal: more_nodes: nextfree: can't allocate 4000 bytes of memory (Not enough space)
当输出文件大约200MB时出现以上错误
有没有其他快速方法可以让我在 windows 上做同样的事情?
您可以在 Windows 之上的 UNIX 机器或 Cygwin 上执行此操作:
$ cat file
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.
Loud the winds howl, loud the waves roar,
Speed, bonnie boat, like a bird on the wing,
Thunderclaps rend the air;
Onward! the sailors cry;
Baffled, our foes stand by the shore,
Carry the lad that's born to be King
Follow they will not dare.
Over the sea to Skye.
.
$ cat -n file | sort -k2 -u | sort -n | cut -f2-
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.
Loud the winds howl, loud the waves roar,
Thunderclaps rend the air;
Baffled, our foes stand by the shore,
Follow they will not dare.
上面唯一试图一次处理整个文件的命令是 sort
并且 sort
旨在使用分页等来准确处理大文件(参见 https://unix.stackexchange.com/q/279096/133219) 所以恕我直言,这是您能够做到这一点的最佳机会。
从 cat -n file
开始,然后将每个命令一次一个地添加到管道中以查看它在做什么(见下文),但它只是先添加行号,这样我们就可以按内容进行唯一排序获取唯一值,然后按原始行号排序以恢复原始行顺序,然后删除我们在第一步添加的行号:
$ cat -n file
1 Speed, bonnie boat, like a bird on the wing,
2 Onward! the sailors cry;
3 Carry the lad that's born to be King
4 Over the sea to Skye.
5
6 Loud the winds howl, loud the waves roar,
7 Speed, bonnie boat, like a bird on the wing,
8 Thunderclaps rend the air;
9 Onward! the sailors cry;
10 Baffled, our foes stand by the shore,
11 Carry the lad that's born to be King
12 Follow they will not dare.
13 Over the sea to Skye.
14
.
$ cat -n file | sort -k2 -u
5
10 Baffled, our foes stand by the shore,
3 Carry the lad that's born to be King
12 Follow they will not dare.
6 Loud the winds howl, loud the waves roar,
2 Onward! the sailors cry;
4 Over the sea to Skye.
1 Speed, bonnie boat, like a bird on the wing,
8 Thunderclaps rend the air;
.
$ cat -n file | sort -k2 -u | sort -n
1 Speed, bonnie boat, like a bird on the wing,
2 Onward! the sailors cry;
3 Carry the lad that's born to be King
4 Over the sea to Skye.
5
6 Loud the winds howl, loud the waves roar,
8 Thunderclaps rend the air;
10 Baffled, our foes stand by the shore,
12 Follow they will not dare.
.
$ cat -n file | sort -k2 -u | sort -n | cut -f2-
Speed, bonnie boat, like a bird on the wing,
Onward! the sailors cry;
Carry the lad that's born to be King
Over the sea to Skye.
Loud the winds howl, loud the waves roar,
Thunderclaps rend the air;
Baffled, our foes stand by the shore,
Follow they will not dare.