洗牌两个文本文件中的行对
Shuffling pairs of lines in two text files
我正在做一个机器翻译项目,其中有 450 万行两种语言的文本,English and German. I would like to shuffle these lines prior to dividing the data into shards on which I will train my model. I know the shuf
command described here 允许在一个文件中打乱行,但我如何确保相应的行在一个文件中第二个文件也被改组成相同的顺序?是否有在两个文件中随机排列的命令?
TL;DR
paste
将两个文件中的单独列创建到一个文件中
shuf
单个文件
cut
拆分列
粘贴
$ cat test.en
a b c
d e f
g h i
$ cat test.de
1 2 3
4 5 6
7 8 9
$ paste test.en test.de > test.en-de
$ cat test.en-de
a b c 1 2 3
d e f 4 5 6
g h i 7 8 9
随机播放
$ shuf test.en-de > test.en-de.shuf
$ cat test.en-de.shuf
d e f 4 5 6
a b c 1 2 3
g h i 7 8 9
剪切
$ cut -f1 test.en-de.shuf> test.en-de.shuf.en
$ cut -f2 test.en-de.shuf> test.en-de.shuf.de
$ cat test.en-de.shuf.en
d e f
a b c
g h i
$ cat test.en-de.shuf.de
4 5 6
1 2 3
7 8 9
我正在做一个机器翻译项目,其中有 450 万行两种语言的文本,English and German. I would like to shuffle these lines prior to dividing the data into shards on which I will train my model. I know the shuf
command described here 允许在一个文件中打乱行,但我如何确保相应的行在一个文件中第二个文件也被改组成相同的顺序?是否有在两个文件中随机排列的命令?
TL;DR
paste
将两个文件中的单独列创建到一个文件中shuf
单个文件cut
拆分列
粘贴
$ cat test.en
a b c
d e f
g h i
$ cat test.de
1 2 3
4 5 6
7 8 9
$ paste test.en test.de > test.en-de
$ cat test.en-de
a b c 1 2 3
d e f 4 5 6
g h i 7 8 9
随机播放
$ shuf test.en-de > test.en-de.shuf
$ cat test.en-de.shuf
d e f 4 5 6
a b c 1 2 3
g h i 7 8 9
剪切
$ cut -f1 test.en-de.shuf> test.en-de.shuf.en
$ cut -f2 test.en-de.shuf> test.en-de.shuf.de
$ cat test.en-de.shuf.en
d e f
a b c
g h i
$ cat test.en-de.shuf.de
4 5 6
1 2 3
7 8 9