如何以最低的 RAM 消耗按值对 jsonl 文件进行排序?
How to sort a jsonl file by value with the lowest RAM consumption?
我有一个非常大的 jsonl 文件(几百万行)。
我想根据给定值对这个文件进行排序,但我不想将它完全加载到 RAM 中。
您有建议的解决方案吗?
我查看了带有 sort_by
选项的 jq
,但我认为该文件未流式传输。
补充说明:
- 组中的顺序无关紧要
- 如果该方法需要拆分文件,那么拥有与用户名一样多的输出对我来说也很好。
示例:
这是我的输入文件的虚拟示例:
{"username": "user1", "email": "email1", "value": "10"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user1", "email": "email1", "value": "5"}
{"username": "user3", "email": "email3", "value": "15"}
{"username": "user1", "email": "email1", "value": "40"}
{"username": "user3", "email": "email1", "value": "40"}
这是我想要的输出:
{"username": "user1", "email": "email1", "value": "10"}
{"username": "user1", "email": "email1", "value": "5"}
{"username": "user1", "email": "email1", "value": "40"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user3", "email": "email3", "value": "15"}
{"username": "user3", "email": "email1", "value": "40"}
一种方法是将文档转换为可以由处理有限内存的工具排序的行,例如 sort
unix command-line 实用程序。
您可以使用以下内容:
jq -r '"\( .username )\u0000\( tojson )"' a.json |
sort |
jq -Rc '. / "\u0000" | .[-1] | fromjson'
对于提供的输入,上面的代码产生以下输出:
{"username":"user1","email":"email1","value":"10"}
{"username":"user1","email":"email1","value":"40"}
{"username":"user1","email":"email1","value":"5"}
{"username":"user2","email":"email2","value":"30"}
{"username":"user2","email":"email2","value":"30"}
{"username":"user3","email":"email1","value":"40"}
{"username":"user3","email":"email3","value":"15"}
按照同样的思路,您可以生产 TSV (jq -r '"\( .username )\t\( tojson )"'
),也可以 inject into a database。然后是一个简单的 SQL 查询来提取排序的 JSON 文档。
我有一个非常大的 jsonl 文件(几百万行)。
我想根据给定值对这个文件进行排序,但我不想将它完全加载到 RAM 中。
您有建议的解决方案吗?
我查看了带有 sort_by
选项的 jq
,但我认为该文件未流式传输。
补充说明:
- 组中的顺序无关紧要
- 如果该方法需要拆分文件,那么拥有与用户名一样多的输出对我来说也很好。
示例:
这是我的输入文件的虚拟示例:
{"username": "user1", "email": "email1", "value": "10"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user1", "email": "email1", "value": "5"}
{"username": "user3", "email": "email3", "value": "15"}
{"username": "user1", "email": "email1", "value": "40"}
{"username": "user3", "email": "email1", "value": "40"}
这是我想要的输出:
{"username": "user1", "email": "email1", "value": "10"}
{"username": "user1", "email": "email1", "value": "5"}
{"username": "user1", "email": "email1", "value": "40"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user2", "email": "email2", "value": "30"}
{"username": "user3", "email": "email3", "value": "15"}
{"username": "user3", "email": "email1", "value": "40"}
一种方法是将文档转换为可以由处理有限内存的工具排序的行,例如 sort
unix command-line 实用程序。
您可以使用以下内容:
jq -r '"\( .username )\u0000\( tojson )"' a.json |
sort |
jq -Rc '. / "\u0000" | .[-1] | fromjson'
对于提供的输入,上面的代码产生以下输出:
{"username":"user1","email":"email1","value":"10"}
{"username":"user1","email":"email1","value":"40"}
{"username":"user1","email":"email1","value":"5"}
{"username":"user2","email":"email2","value":"30"}
{"username":"user2","email":"email2","value":"30"}
{"username":"user3","email":"email1","value":"40"}
{"username":"user3","email":"email3","value":"15"}
按照同样的思路,您可以生产 TSV (jq -r '"\( .username )\t\( tojson )"'
),也可以 inject into a database。然后是一个简单的 SQL 查询来提取排序的 JSON 文档。