洗牌两个文本文件中的行对

Shuffling pairs of lines in two text files

我正在做一个机器翻译项目,其中有 450 万行两种语言的文本,English and German. I would like to shuffle these lines prior to dividing the data into shards on which I will train my model. I know the shuf command described here 允许在一个文件中打乱行,但我如何确保相应的行在一个文件中第二个文件也被改组成相同的顺序?是否有在两个文件中随机排列的命令?

TL;DR

  • paste 将两个文件中的单独列创建到一个文件中
  • shuf 单个文件
  • cut 拆分列

粘贴

$ cat test.en 
a b c
d e f
g h i

$ cat test.de 
1 2 3
4 5 6
7 8 9

$ paste test.en test.de > test.en-de

$ cat test.en-de
a b c   1 2 3
d e f   4 5 6
g h i   7 8 9

随机播放

$ shuf test.en-de > test.en-de.shuf

$ cat test.en-de.shuf
d e f   4 5 6
a b c   1 2 3
g h i   7 8 9

剪切

$ cut -f1 test.en-de.shuf> test.en-de.shuf.en
$ cut -f2 test.en-de.shuf> test.en-de.shuf.de

$ cat test.en-de.shuf.en 
d e f
a b c
g h i

$ cat test.en-de.shuf.de
4 5 6
1 2 3
7 8 9