如何以最低的 RAM 消耗按值对 jsonl 文件进行排序?

How to sort a jsonl file by value with the lowest RAM consumption?

我有一个非常大的 jsonl 文件(几百万行)。
我想根据给定值对这个文件进行排序,但我不想将它完全加载到 RAM 中。
您有建议的解决方案吗?

我查看了带有 sort_by 选项的 jq,但我认为该文件未流式传输。

补充说明:

示例:

这是我的输入文件的虚拟示例:

{"username": "user1", "email": "email1", "value": "10"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user1", "email": "email1", "value": "5"}
{"username": "user3", "email": "email3", "value": "15"}
{"username": "user1", "email": "email1", "value": "40"}
{"username": "user3", "email": "email1", "value": "40"}

这是我想要的输出:

{"username": "user1", "email": "email1", "value": "10"}
{"username": "user1", "email": "email1", "value": "5"}
{"username": "user1", "email": "email1", "value": "40"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user3", "email": "email3", "value": "15"}
{"username": "user3", "email": "email1", "value": "40"}

一种方法是将文档转换为可以由处理有限内存的工具排序的行,例如 sort unix command-line 实用程序。

您可以使用以下内容:

jq -r '"\( .username )\u0000\( tojson )"' a.json |
sort |
jq -Rc '. / "\u0000" | .[-1] | fromjson'

对于提供的输入,上面的代码产生以下输出:

{"username":"user1","email":"email1","value":"10"}
{"username":"user1","email":"email1","value":"40"}
{"username":"user1","email":"email1","value":"5"}
{"username":"user2","email":"email2","value":"30"}
{"username":"user2","email":"email2","value":"30"}
{"username":"user3","email":"email1","value":"40"}
{"username":"user3","email":"email3","value":"15"}

按照同样的思路,您可以生产 TSV (jq -r '"\( .username )\t\( tojson )"'),也可以 inject into a database。然后是一个简单的 SQL 查询来提取排序的 JSON 文档。