AWK - 处理基于多列(复合键)的重复数据的大型 CSV(130 亿行)导致内存不足错误
AWK - Processing a large CSV (13 Billion Rows) for duplicate data based on multiple columns (composite key) results in Out of Memory error
我有一个 CSV 文件,其中包含 130 亿行,大小为 719GB。 CSV 中有一些重复的行。 CSV 包含三列,示例数据如下:
tag,time,sensor_value
"CHLR_3_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
组合键 tag
和 time
的每个实例都应该是唯一的。换句话说,tag
可以在给定的 time
.
处有一个值
我尝试了以下方法:
awk -F, '!seen[,]++' data.csv > data_UNIQUE.csv
由于 Out Of Memory
错误,内核最终终止了上述进程。我的系统规格如下:
Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz
128GB RAM
2TB NVME
如何使用 awk 成功处理此 CSV?
编辑:
所需的输出 CSV 将没有重复数据,并且根据评论中的讨论,在传递给 awk
之前进行排序是有意义的,因此我们只查看相邻的行。
期望的输出:
tag,time,sensor_value
"CHLR_1_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-08-31 10:27:13.736248","59.76000213623047"
基于多列键排序。然后单次遍历排序文件可以通过仅查看相邻记录来消除重复项。无需将整个文件存储在内存中。
编辑:比较相邻记录并忽略重复记录(警告:未经测试的代码):
( ) == prev { next }
{ prev = ( ); print [=10=] }
第一行将前两个字段的串联与上一条记录进行比较,如果当前记录和上一条记录的关键字段匹配,则跳到下一条记录。仅当当前记录与前一条记录不同时才执行第二行;它将密钥保存在前两个字段中,然后打印记录。
使用任何版本的强制性 Unix 工具 awk、sort 和 cut,这将按 2 个键值对输出进行排序:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN { FS=OFS="," }
{ print (NR>1), NR, [=10=] }
' "${@:--}" |
sort -t, -k1,1n -k3,4 -k2,2n |
cut -d, -f3- |
awk '
BEGIN { FS=OFS="," }
{ key = FS }
key != prev {
print
prev = key
}
'
$ ./tst.sh file
tag,time,sensor_value
"CHLR_1_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-08-31 10:27:13.736248","59.76000213623047"
这将保留输出的输入顺序:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN { FS=OFS="," }
{ print (NR>1), NR, [=12=] }
' "${@:--}" |
sort -t, -k3,4 |
awk '
BEGIN { FS=OFS="," }
{ key = FS FS }
key != prev {
print
prev = key
}
' |
sort -t, -k1,1n -k2,2n |
cut -d, -f3-
$ ./tst.sh file
tag,time,sensor_value
"CHLR_3_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
我们使用 awk(打印 NR>1
)修饰输入,以将 header 行(0
)与其余行(1
)分开,而不是使用 head -n 1 test.csv && tail -n +2 test.csv | sort...
因为后者要求输入文件被打开两次,所以如果输入来自管道则无法工作。
我们还用 NR
修饰,这样给定 2 个重复键,打印的值将是输入中出现的第一个值(或者我们可以反转该字段的排序,以便打印最后一个值,如果那是可取的)。我们可以改为对 -s
使用 GNU 排序,但这样解决方案就变得不必要了 GNU-only.
我有一个 CSV 文件,其中包含 130 亿行,大小为 719GB。 CSV 中有一些重复的行。 CSV 包含三列,示例数据如下:
tag,time,sensor_value
"CHLR_3_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
组合键 tag
和 time
的每个实例都应该是唯一的。换句话说,tag
可以在给定的 time
.
我尝试了以下方法:
awk -F, '!seen[,]++' data.csv > data_UNIQUE.csv
由于 Out Of Memory
错误,内核最终终止了上述进程。我的系统规格如下:
Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz
128GB RAM
2TB NVME
如何使用 awk 成功处理此 CSV?
编辑:
所需的输出 CSV 将没有重复数据,并且根据评论中的讨论,在传递给 awk
之前进行排序是有意义的,因此我们只查看相邻的行。
期望的输出:
tag,time,sensor_value
"CHLR_1_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-08-31 10:27:13.736248","59.76000213623047"
基于多列键排序。然后单次遍历排序文件可以通过仅查看相邻记录来消除重复项。无需将整个文件存储在内存中。
编辑:比较相邻记录并忽略重复记录(警告:未经测试的代码):
( ) == prev { next }
{ prev = ( ); print [=10=] }
第一行将前两个字段的串联与上一条记录进行比较,如果当前记录和上一条记录的关键字段匹配,则跳到下一条记录。仅当当前记录与前一条记录不同时才执行第二行;它将密钥保存在前两个字段中,然后打印记录。
使用任何版本的强制性 Unix 工具 awk、sort 和 cut,这将按 2 个键值对输出进行排序:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN { FS=OFS="," }
{ print (NR>1), NR, [=10=] }
' "${@:--}" |
sort -t, -k1,1n -k3,4 -k2,2n |
cut -d, -f3- |
awk '
BEGIN { FS=OFS="," }
{ key = FS }
key != prev {
print
prev = key
}
'
$ ./tst.sh file
tag,time,sensor_value
"CHLR_1_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-08-31 10:27:13.736248","59.76000213623047"
这将保留输出的输入顺序:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN { FS=OFS="," }
{ print (NR>1), NR, [=12=] }
' "${@:--}" |
sort -t, -k3,4 |
awk '
BEGIN { FS=OFS="," }
{ key = FS FS }
key != prev {
print
prev = key
}
' |
sort -t, -k1,1n -k2,2n |
cut -d, -f3-
$ ./tst.sh file
tag,time,sensor_value
"CHLR_3_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_3_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_1_SP","2020-08-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-03-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-01-31 10:27:13.736248","59.76000213623047"
"CHLR_2_SP","2020-08-31 10:27:13.736248","59.76000213623047"
我们使用 awk(打印 NR>1
)修饰输入,以将 header 行(0
)与其余行(1
)分开,而不是使用 head -n 1 test.csv && tail -n +2 test.csv | sort...
因为后者要求输入文件被打开两次,所以如果输入来自管道则无法工作。
我们还用 NR
修饰,这样给定 2 个重复键,打印的值将是输入中出现的第一个值(或者我们可以反转该字段的排序,以便打印最后一个值,如果那是可取的)。我们可以改为对 -s
使用 GNU 排序,但这样解决方案就变得不必要了 GNU-only.