使用 jq 将元素数组分解为多个不同长度的数组
using jq to break array of elements into multiple arrays of varying length
我有一个 JSON 对象,我想使用 jq 将其从一种形式转换为另一种形式(当然,我可以使用 javascript 或 python 并进行迭代,但是jq 会更好)。问题是输入包含长数组,只要数据在第一个数组中停止重复,就需要将其分成多个较小的数组。我不太确定如何描述这个问题,所以我将在这里举一个例子,希望它更能说明问题。一个安全的假设——如果它有任何帮助的话——是输入数据总是在前两个元素上预先排序(例如 "row_x" 和 "col_y"):
输入:
{
"headers": [ "col1", "col2", "col3" ],
"data": [
[ "row1","col1","b","src2" ],
[ "row1","col1","b","src1" ],
[ "row1","col1","b","src3" ],
[ "row1","col2","d","src4" ],
[ "row1","col2","e","src5" ],
[ "row1","col2","f","src6" ],
[ "row1","col3","j","src7" ],
[ "row1","col3","g","src8" ],
[ "row1","col3","h","src9" ],
[ "row1","col3","i","src10" ],
[ "row2","col1","l","src13" ],
[ "row2","col1","j","src11" ],
[ "row2","col1","k","src12" ],
[ "row2","col3","o","src15" ]
]
}
期望的输出:
{
"headers": [ "col1", "col2", "col3" ],
"values": [
[["b","b","b"],["d","e","f"],["g","h","i","j"]],
[["j","k","l"],null,["o"]]
],
"sources": [
[["src1","src2","src3"],["src4","src5","src6"],["src7","src8","src9","src10"]],
[["src11","src12","src13"],null,["src15"]]
]
}
这在 jq 中完全可行吗?
UPDATE:这个的一个变体是保留原来的数据顺序,所以输出如下:
{
"headers": [ "col1", "col2", "col3" ],
"values": [
[["b","b","b"],["d","e","f"],["j","g","h","i"]],
[["l","j","k"],null,["o"]]
],
"sources": [
[["src2","src1","src3"],["src4","src5","src6"],["src7","src8","src9","src10"]],
[["src13","src11","src12"],null,["src15"]]
]
}
可行吗?当然可以!
首先,您需要按行然后按列对数据进行分组。然后与小组一起构建您的 values/sources 数组。
.headers as $headers | .data
# make the data easier to access
| map({ row: .[0], col: .[1], val: .[2], src: .[3] })
# keep it sorted so they are in expected order in the end
| sort_by([.row,.col,.src])
# group by rows
| group_by(.row)
# create a map to each of the cols for easier access
| map(group_by(.col)
| reduce .[] as $col ({};
.[$col[0].col] = [$col[] | {val,src}]
)
)
# build the result
| {
headers: $headers,
values: map([.[$headers[]] | [.[]?.val]]),
sources: map([.[$headers[]] | [.[]?.src]])
}
这将产生以下结果:
{
"headers": [ "col1", "col2", "col3" ],
"values": [
[
[ "b", "b", "b" ],
[ "d", "e", "f" ],
[ "i", "j", "g", "h" ]
],
[
[ "j", "k", "l" ],
[],
[ "o" ]
]
],
"sources": [
[
[ "src1", "src2", "src3" ],
[ "src4", "src5", "src6" ],
[ "src10", "src7", "src8", "src9" ]
],
[
[ "src11", "src12", "src13" ],
[],
[ "src15" ]
]
]
}
因为这里的主要数据源可以被认为是
two-dimensional矩阵,可能值得考虑一个
matrix-oriented 解决问题的方法,特别是如果它是
旨在输入矩阵中的空行不被简单地省略,或者如果
矩阵中的列数最初未知。
为了让事情更有趣一点,让我们选择代表一个 m x n
矩阵 M,作为 [m, n, a] 形式的 JSON 数组,其中 a 是数组
数组,使得 a[i][j] 是 M 中第 i 行第 j 列的元素。
首先,让我们定义一些基本的matrix-oriented操作:
def ij(i;j): .[2][i][j];
def set_ij(i;j;value):
def max(a;b): if a < b then b else a end;
.[0] as $m | .[1] as $n
| [max(i+1;$m), max(j+1;$n), (.[2] | setpath([i,j];value)) ];
数据源对第 i 行使用 "rowI" 形式的字符串
和 "colJ" 第 j 行,因此我们相应地定义一个 matrix-update 函数:
def update_row_col( row; col; value):
((row|sub("^row";"")|tonumber) - 1) as $r
| ((col|sub("^col";"")|tonumber) - 1) as $c
| ij($r;$c) as $v
| set_ij($r; $c; if $v == null then [value] else $v + [value] end) ;
给定 ["rowI","colJ", V, S] 形式的项目数组,
在第 I 行生成一个值为 {"source": S, "value": V} 的矩阵
和 J 列:
def generate:
reduce .[] as $x ([0,0,null];
update_row_col( $x[0]; $x[1]; { "source": $x[3], "value": $x[2] }) );
现在我们转向想要的输出。以下过滤器将从输入矩阵中提取 f,生成数组数组,将 [] 替换为 null:
def extract(f):
. as $m
| (reduce range(0; $m[0]) as $i
([];
. + ( reduce range(0; $m[1]) as $j
([];
. + [ $m | ij($i;$j) // [] | map(f) ]) ) ))
| map( if length == 0 then null else . end );
将它们放在一起(动态生成 headers 留作感兴趣 reader 的练习):
{headers} +
(.data | generate
| { "values": extract(.value), "sources": extract(.source) } )
输出:
{
"headers": [
"col1",
"col2",
"col3"
],
"values": [
[
"b",
"b",
"b"
],
[
"d",
"e",
"f"
],
[
"j",
"g",
"h",
"i"
],
[
"l",
"j",
"k"
],
null,
[
"o"
]
],
"sources": [
[
"src2",
"src1",
"src3"
],
[
"src4",
"src5",
"src6"
],
[
"src7",
"src8",
"src9",
"src10"
],
[
"src13",
"src11",
"src12"
],
null,
[
"src15"
]
]
}
这是一个使用reduce、getpath和setpath[=12=的解决方案]
.headers as $headers
| reduce .data[] as [$r,$c,$v,$s] (
{headers:$headers, values:{}, sources:{}}
; setpath(["values", $r, $c]; (getpath(["values", $r, $c]) // []) + [$v])
| setpath(["sources", $r, $c]; (getpath(["sources", $r, $c]) // []) + [$s])
)
| .values = [ .values[] | [ .[ $headers[] ] ] ]
| .sources = [ .sources[] | [ .[ $headers[] ] ] ]
示例输出(为了便于阅读而手动重新格式化)
{
"headers":["col1","col2","col3"],
"values":[[["b","b","b"],["d","e","f"],["j","g","h","i"]],
[["l","j","k"],null,["o"]]],
"sources":[[["src2","src1","src3"],["src4","src5","src6"],["src7","src8","src9","src10"]],
[["src13","src11","src12"],null,["src15"]]]
}
我有一个 JSON 对象,我想使用 jq 将其从一种形式转换为另一种形式(当然,我可以使用 javascript 或 python 并进行迭代,但是jq 会更好)。问题是输入包含长数组,只要数据在第一个数组中停止重复,就需要将其分成多个较小的数组。我不太确定如何描述这个问题,所以我将在这里举一个例子,希望它更能说明问题。一个安全的假设——如果它有任何帮助的话——是输入数据总是在前两个元素上预先排序(例如 "row_x" 和 "col_y"):
输入:
{
"headers": [ "col1", "col2", "col3" ],
"data": [
[ "row1","col1","b","src2" ],
[ "row1","col1","b","src1" ],
[ "row1","col1","b","src3" ],
[ "row1","col2","d","src4" ],
[ "row1","col2","e","src5" ],
[ "row1","col2","f","src6" ],
[ "row1","col3","j","src7" ],
[ "row1","col3","g","src8" ],
[ "row1","col3","h","src9" ],
[ "row1","col3","i","src10" ],
[ "row2","col1","l","src13" ],
[ "row2","col1","j","src11" ],
[ "row2","col1","k","src12" ],
[ "row2","col3","o","src15" ]
]
}
期望的输出:
{
"headers": [ "col1", "col2", "col3" ],
"values": [
[["b","b","b"],["d","e","f"],["g","h","i","j"]],
[["j","k","l"],null,["o"]]
],
"sources": [
[["src1","src2","src3"],["src4","src5","src6"],["src7","src8","src9","src10"]],
[["src11","src12","src13"],null,["src15"]]
]
}
这在 jq 中完全可行吗?
UPDATE:这个的一个变体是保留原来的数据顺序,所以输出如下:
{
"headers": [ "col1", "col2", "col3" ],
"values": [
[["b","b","b"],["d","e","f"],["j","g","h","i"]],
[["l","j","k"],null,["o"]]
],
"sources": [
[["src2","src1","src3"],["src4","src5","src6"],["src7","src8","src9","src10"]],
[["src13","src11","src12"],null,["src15"]]
]
}
可行吗?当然可以!
首先,您需要按行然后按列对数据进行分组。然后与小组一起构建您的 values/sources 数组。
.headers as $headers | .data
# make the data easier to access
| map({ row: .[0], col: .[1], val: .[2], src: .[3] })
# keep it sorted so they are in expected order in the end
| sort_by([.row,.col,.src])
# group by rows
| group_by(.row)
# create a map to each of the cols for easier access
| map(group_by(.col)
| reduce .[] as $col ({};
.[$col[0].col] = [$col[] | {val,src}]
)
)
# build the result
| {
headers: $headers,
values: map([.[$headers[]] | [.[]?.val]]),
sources: map([.[$headers[]] | [.[]?.src]])
}
这将产生以下结果:
{
"headers": [ "col1", "col2", "col3" ],
"values": [
[
[ "b", "b", "b" ],
[ "d", "e", "f" ],
[ "i", "j", "g", "h" ]
],
[
[ "j", "k", "l" ],
[],
[ "o" ]
]
],
"sources": [
[
[ "src1", "src2", "src3" ],
[ "src4", "src5", "src6" ],
[ "src10", "src7", "src8", "src9" ]
],
[
[ "src11", "src12", "src13" ],
[],
[ "src15" ]
]
]
}
因为这里的主要数据源可以被认为是 two-dimensional矩阵,可能值得考虑一个 matrix-oriented 解决问题的方法,特别是如果它是 旨在输入矩阵中的空行不被简单地省略,或者如果 矩阵中的列数最初未知。
为了让事情更有趣一点,让我们选择代表一个 m x n 矩阵 M,作为 [m, n, a] 形式的 JSON 数组,其中 a 是数组 数组,使得 a[i][j] 是 M 中第 i 行第 j 列的元素。
首先,让我们定义一些基本的matrix-oriented操作:
def ij(i;j): .[2][i][j];
def set_ij(i;j;value):
def max(a;b): if a < b then b else a end;
.[0] as $m | .[1] as $n
| [max(i+1;$m), max(j+1;$n), (.[2] | setpath([i,j];value)) ];
数据源对第 i 行使用 "rowI" 形式的字符串 和 "colJ" 第 j 行,因此我们相应地定义一个 matrix-update 函数:
def update_row_col( row; col; value):
((row|sub("^row";"")|tonumber) - 1) as $r
| ((col|sub("^col";"")|tonumber) - 1) as $c
| ij($r;$c) as $v
| set_ij($r; $c; if $v == null then [value] else $v + [value] end) ;
给定 ["rowI","colJ", V, S] 形式的项目数组, 在第 I 行生成一个值为 {"source": S, "value": V} 的矩阵 和 J 列:
def generate:
reduce .[] as $x ([0,0,null];
update_row_col( $x[0]; $x[1]; { "source": $x[3], "value": $x[2] }) );
现在我们转向想要的输出。以下过滤器将从输入矩阵中提取 f,生成数组数组,将 [] 替换为 null:
def extract(f):
. as $m
| (reduce range(0; $m[0]) as $i
([];
. + ( reduce range(0; $m[1]) as $j
([];
. + [ $m | ij($i;$j) // [] | map(f) ]) ) ))
| map( if length == 0 then null else . end );
将它们放在一起(动态生成 headers 留作感兴趣 reader 的练习):
{headers} +
(.data | generate
| { "values": extract(.value), "sources": extract(.source) } )
输出:
{
"headers": [
"col1",
"col2",
"col3"
],
"values": [
[
"b",
"b",
"b"
],
[
"d",
"e",
"f"
],
[
"j",
"g",
"h",
"i"
],
[
"l",
"j",
"k"
],
null,
[
"o"
]
],
"sources": [
[
"src2",
"src1",
"src3"
],
[
"src4",
"src5",
"src6"
],
[
"src7",
"src8",
"src9",
"src10"
],
[
"src13",
"src11",
"src12"
],
null,
[
"src15"
]
]
}
这是一个使用reduce、getpath和setpath[=12=的解决方案]
.headers as $headers
| reduce .data[] as [$r,$c,$v,$s] (
{headers:$headers, values:{}, sources:{}}
; setpath(["values", $r, $c]; (getpath(["values", $r, $c]) // []) + [$v])
| setpath(["sources", $r, $c]; (getpath(["sources", $r, $c]) // []) + [$s])
)
| .values = [ .values[] | [ .[ $headers[] ] ] ]
| .sources = [ .sources[] | [ .[ $headers[] ] ] ]
示例输出(为了便于阅读而手动重新格式化)
{
"headers":["col1","col2","col3"],
"values":[[["b","b","b"],["d","e","f"],["j","g","h","i"]],
[["l","j","k"],null,["o"]]],
"sources":[[["src2","src1","src3"],["src4","src5","src6"],["src7","src8","src9","src10"]],
[["src13","src11","src12"],null,["src15"]]]
}