从链接的行中提取行族
Extract row families from linked rows
我有一个 table 的关联交易类似于以下 table
+----+----+----+
| # | A | B |
+----+----+----+
| 1 | 1 | 4 |
| 2 | 3 | 5 |
| 3 | 4 | 6 |
| 4 | 5 | 8 |
| 5 | 6 | 1 |
| 6 | 7 | 7 |
| 7 | 8 | 3 |
| 8 | 9 | 3 |
| 9 | 10 | 4 |
| 10 | 11 | 14 |
| 11 | 2 | 2 |
| 12 | 12 | 4 |
| 13 | 13 | 14 |
| 14 | 14 | 9 |
| 15 | 15 | 1 |
+----+----+----+
A列和B列下的数字代表交易ID。因此,例如,交易 1 通过某些标准与交易 4 相关联,tran 3 与 tran 5 相关联,tran 4 与 tran 6 相关联,依此类推。
事务 2 和 7 未链接到任何其他事务,因此它们是自链接的。
我想从中提取交易系列 table- 因为 tran 1 和 4 是链接的,tran 4 和 6 是链接的,tran 10 和 4 是链接的等等,它们属于一个交易系列 - (1,4,6,10,12,15)。
我想创建交易 ID 最低的交易系列作为主交易。
所以理想情况下,输出将如下所示
+----+------+--------------+
| # | Tran | Master_tran |
+----+------+--------------+
| 1 | 1 | 1 |
| 2 | 3 | 3 |
| 3 | 4 | 1 |
| 4 | 5 | 3 |
| 5 | 6 | 1 |
| 6 | 7 | 7 |
| 7 | 8 | 3 |
| 8 | 9 | 3 |
| 9 | 10 | 1 |
| 10 | 11 | 3 |
| 11 | 2 | 2 |
| 12 | 12 | 1 |
| 13 | 13 | 3 |
| 14 | 14 | 3 |
| 15 | 15 | 1 |
+----+------+----+
我一直在研究自连接。
SELECT t1.a as x,
least (min(t1.b), min(t2.a)) as y
FROM test t1
LEFT JOIN test t2 on t2.b = t1.a
GROUP BY t1.a
ORDER BY t1.a asc
此代码给出以下输出
+------+----+---+
| Col1 | X | Y |
+------+----+---+
| 1 | 1 | 4 |
| 2 | 2 | 2 |
| 3 | 3 | 5 |
| 4 | 4 | 1 |
| 5 | 5 | 3 |
| 6 | 6 | 1 |
| 7 | 7 | 7 |
| 8 | 8 | 3 |
| 9 | 9 | 3 |
| 10 | 10 | |
| 11 | 11 | |
| 12 | 12 | |
| 13 | 13 | |
| 14 | 14 | 9 |
| 15 | 15 | |
+------+----+---+
我不确定我的代码有什么问题。有人能指出我正确的方向吗?
谢谢!
原则上你需要一个CONNECT BY语句来解决这样的层级问题。
当你有循环时,你还需要一个 NOCYCLE 子句,这将消除循环中的最后一个 link,这很好,因为 link 永远不会成为答案的一部分。
您在两个方向 (f.e. (13, 14) 和 (14, 9)) 也有 links,因此您必须小心地将其包含在查询中(两次!)。
WITH t_order
AS (SELECT qt.qt_id, qt.qt_a, qt.qt_b, LEAST( qt.qt_a, qt.qt_b ) AS t_parent, GREATEST( qt.qt_a, qt.qt_b ) AS t_child
FROM query_test qt
UNION
SELECT qb.qt_id, qb.qt_a, qb.qt_b, GREATEST( qb.qt_a, qb.qt_b ) AS t_parent, LEAST( qb.qt_a, qb.qt_b ) AS t_child
FROM query_test qb)
, hier
AS (SELECT ps.qt_id
, ps.qt_a
, ps.qt_b
, t_parent
, t_child
, LEVEL
, CONNECT_BY_ROOT t_parent AS prev_tran
FROM t_order ps
CONNECT BY NOCYCLE PRIOR t_child = t_parent)
SELECT hr.qt_id, hr.qt_a, MIN( hr.prev_tran ) AS master_tran
FROM hier hr
GROUP BY hr.qt_id, hr.qt_a
ORDER BY hr.qt_id, hr.qt_a;
这将解决您的问题,但如果必须处理这 100.000 条记录,可能会变得非常慢。如果您需要将此方法与许多其他列结合使用,SQL 语句也会变得难以理解。为此,您应该分解出所有 qt.qt 列并在最后一个 select.
中加入它们
WITH t_order
AS (SELECT DISTINCT tran, root_tran
FROM (SELECT LEAST( qt.qt_a, qt.qt_b ) AS tran, GREATEST( qt.qt_a, qt.qt_b ) AS root_tran
FROM query_test qt
UNION
SELECT GREATEST( qb.qt_a, qb.qt_b ) AS tran, LEAST( qb.qt_a, qb.qt_b ) AS root_tran
FROM query_test qb))
, hier
AS (SELECT DISTINCT tran, root_tran
FROM (SELECT tran, CONNECT_BY_ROOT root_tran AS root_tran
FROM t_order
CONNECT BY NOCYCLE PRIOR tran = root_tran)
WHERE tran >= root_tran)
SELECT qt.qt_id
, qt.qt_a
, MIN( LEAST( h1.root_tran, h2.root_tran ) ) AS master_tran
FROM query_test qt
INNER JOIN hier h1 ON qt.qt_a = h1.tran
INNER JOIN hier h2 ON qt.qt_b = h2.tran
GROUP BY qt.qt_id, qt.qt_a
ORDER BY qt.qt_id, qt.qt_a;
我无法测试最后一个语句。
我可能已经创建了其他解决方案。
除了使用 CONNECT BY 语句,您还可以将 link 加倍,并在需要时随时加倍。
检索所有 link 的查询保持不变,但后面跟着一个简单的查询,用两个 link 的所有不同组合替换原始 link。
包括由tran_a和tran_b组成的link,你有2 + 1 + 2 link,所以你最多可以找到5条link很长。
如果那太短,你在前一个子查询下插入一个相同的子查询,现在它是 4 + 1 + 4 使 9 links 长。
如您所见,每个添加的子查询的最大路径长度都会增加一倍,性能成本只会增加一些。
首先查询以检查您的演示数据:
WITH double_0
AS (SELECT DISTINCT root_tran, tran
FROM ( SELECT LEAST( td_0.tran_a, td_0.tran_b ) AS root_tran
, GREATEST( td_0.tran_a, td_0.tran_b ) AS tran
FROM tran_demo td_0
UNION
SELECT GREATEST( qb.tran_a, qb.tran_b ) AS root_tran
, LEAST( qb.tran_a, qb.tran_b ) AS tran
FROM tran_demo qb ))
, double_1
AS (SELECT DISTINCT oa.root_tran, ob.tran
FROM double_0 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
SELECT td_1.td_id
, td_1.tran_a
, MIN( LEAST( d1.root_tran, d2.root_tran ) ) AS master_tran
FROM tran_demo td_1
INNER JOIN double_1 d1 ON td_1.tran_a = d1.tran
INNER JOIN double_1 d2 ON td_1.tran_b = d2.tran
GROUP BY td_1.td_id, td_1.tran_a
ORDER BY td_1.td_id, td_1.tran_a;
那么你如何修改它:
请注意,您现在在最终查询中查询 double_2。
WITH double_0
AS (SELECT DISTINCT root_tran, tran
FROM ( SELECT LEAST( td_0.tran_a, td_0.tran_b ) AS root_tran
, GREATEST( td_0.tran_a, td_0.tran_b ) AS tran
FROM tran_demo td_0
UNION
SELECT GREATEST( qb.tran_a, qb.tran_b ) AS root_tran
, LEAST( qb.tran_a, qb.tran_b ) AS tran
FROM tran_demo qb ))
, double_1
AS (SELECT DISTINCT oa.root_tran, ob.tran
FROM double_0 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
, double_2
AS (SELECT DISTINCT oa.root_tran, ob.tran
FROM double_1 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
SELECT td_1.td_id
, td_1.tran_a
, MIN( LEAST( d1.root_tran, d2.root_tran ) ) AS master_tran
FROM tran_demo td_1
INNER JOIN double_2 d1 ON td_1.tran_a = d1.tran
INNER JOIN double_2 d2 ON td_1.tran_b = d2.tran
GROUP BY td_1.td_id, td_1.tran_a
ORDER BY td_1.td_id, td_1.tran_a;
最后一个查询来检查您使用的路径长度是否仍然足够:
您已经添加了下一个级别并减去当前级别。
只要此查询没有 return 任何行,当前查询就是正确的。
WITH double_0
AS (SELECT DISTINCT root_tran, tran
FROM ( SELECT LEAST( td_0.tran_a, td_0.tran_b ) AS root_tran
, GREATEST( td_0.tran_a, td_0.tran_b ) AS tran
FROM tran_demo td_0
UNION
SELECT GREATEST( qb.tran_a, qb.tran_b ) AS root_tran
, LEAST( qb.tran_a, qb.tran_b ) AS tran
FROM tran_demo qb ))
, double_1
AS (SELECT DISTINCT oa.root_tran, ob.tran
FROM double_0 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
, double_2
AS (SELECT DISTINCT oa.root_tran, ob.tran
FROM double_1 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
SELECT td_1.tran_a
, MIN( LEAST( d1.root_tran, d2.root_tran ) ) AS master_tran
FROM tran_demo td_1
INNER JOIN double_2 d1 ON td_1.tran_a = d1.tran
INNER JOIN double_2 d2 ON td_1.tran_b = d2.tran
GROUP BY td_1.tran_a
MINUS
SELECT td_2.tran_a
, MIN( LEAST( d1.root_tran, d2.root_tran ) ) AS master_tran
FROM tran_demo td_2
INNER JOIN double_1 d1 ON td_2.tran_a = d1.tran
INNER JOIN double_1 d2 ON td_2.tran_b = d2.tran
GROUP BY td_2.tran_a
ORDER BY tran_a;
您必须自己进行性能测试。
我很乐观,因为子查询很便宜,每次有效路径长度都会加倍。
这迟早会变得比以前的解决方案更快。
顺便说一句,关于排序原始 links 的评论在这里也有效!
如果有效请标记我的答案
我有一个 table 的关联交易类似于以下 table
+----+----+----+
| # | A | B |
+----+----+----+
| 1 | 1 | 4 |
| 2 | 3 | 5 |
| 3 | 4 | 6 |
| 4 | 5 | 8 |
| 5 | 6 | 1 |
| 6 | 7 | 7 |
| 7 | 8 | 3 |
| 8 | 9 | 3 |
| 9 | 10 | 4 |
| 10 | 11 | 14 |
| 11 | 2 | 2 |
| 12 | 12 | 4 |
| 13 | 13 | 14 |
| 14 | 14 | 9 |
| 15 | 15 | 1 |
+----+----+----+
A列和B列下的数字代表交易ID。因此,例如,交易 1 通过某些标准与交易 4 相关联,tran 3 与 tran 5 相关联,tran 4 与 tran 6 相关联,依此类推。
事务 2 和 7 未链接到任何其他事务,因此它们是自链接的。
我想从中提取交易系列 table- 因为 tran 1 和 4 是链接的,tran 4 和 6 是链接的,tran 10 和 4 是链接的等等,它们属于一个交易系列 - (1,4,6,10,12,15)。
我想创建交易 ID 最低的交易系列作为主交易。 所以理想情况下,输出将如下所示
+----+------+--------------+
| # | Tran | Master_tran |
+----+------+--------------+
| 1 | 1 | 1 |
| 2 | 3 | 3 |
| 3 | 4 | 1 |
| 4 | 5 | 3 |
| 5 | 6 | 1 |
| 6 | 7 | 7 |
| 7 | 8 | 3 |
| 8 | 9 | 3 |
| 9 | 10 | 1 |
| 10 | 11 | 3 |
| 11 | 2 | 2 |
| 12 | 12 | 1 |
| 13 | 13 | 3 |
| 14 | 14 | 3 |
| 15 | 15 | 1 |
+----+------+----+
我一直在研究自连接。
SELECT t1.a as x,
least (min(t1.b), min(t2.a)) as y
FROM test t1
LEFT JOIN test t2 on t2.b = t1.a
GROUP BY t1.a
ORDER BY t1.a asc
此代码给出以下输出
+------+----+---+
| Col1 | X | Y |
+------+----+---+
| 1 | 1 | 4 |
| 2 | 2 | 2 |
| 3 | 3 | 5 |
| 4 | 4 | 1 |
| 5 | 5 | 3 |
| 6 | 6 | 1 |
| 7 | 7 | 7 |
| 8 | 8 | 3 |
| 9 | 9 | 3 |
| 10 | 10 | |
| 11 | 11 | |
| 12 | 12 | |
| 13 | 13 | |
| 14 | 14 | 9 |
| 15 | 15 | |
+------+----+---+
我不确定我的代码有什么问题。有人能指出我正确的方向吗? 谢谢!
原则上你需要一个CONNECT BY语句来解决这样的层级问题。 当你有循环时,你还需要一个 NOCYCLE 子句,这将消除循环中的最后一个 link,这很好,因为 link 永远不会成为答案的一部分。 您在两个方向 (f.e. (13, 14) 和 (14, 9)) 也有 links,因此您必须小心地将其包含在查询中(两次!)。
WITH t_order
AS (SELECT qt.qt_id, qt.qt_a, qt.qt_b, LEAST( qt.qt_a, qt.qt_b ) AS t_parent, GREATEST( qt.qt_a, qt.qt_b ) AS t_child
FROM query_test qt
UNION
SELECT qb.qt_id, qb.qt_a, qb.qt_b, GREATEST( qb.qt_a, qb.qt_b ) AS t_parent, LEAST( qb.qt_a, qb.qt_b ) AS t_child
FROM query_test qb)
, hier
AS (SELECT ps.qt_id
, ps.qt_a
, ps.qt_b
, t_parent
, t_child
, LEVEL
, CONNECT_BY_ROOT t_parent AS prev_tran
FROM t_order ps
CONNECT BY NOCYCLE PRIOR t_child = t_parent)
SELECT hr.qt_id, hr.qt_a, MIN( hr.prev_tran ) AS master_tran
FROM hier hr
GROUP BY hr.qt_id, hr.qt_a
ORDER BY hr.qt_id, hr.qt_a;
这将解决您的问题,但如果必须处理这 100.000 条记录,可能会变得非常慢。如果您需要将此方法与许多其他列结合使用,SQL 语句也会变得难以理解。为此,您应该分解出所有 qt.qt 列并在最后一个 select.
中加入它们WITH t_order
AS (SELECT DISTINCT tran, root_tran
FROM (SELECT LEAST( qt.qt_a, qt.qt_b ) AS tran, GREATEST( qt.qt_a, qt.qt_b ) AS root_tran
FROM query_test qt
UNION
SELECT GREATEST( qb.qt_a, qb.qt_b ) AS tran, LEAST( qb.qt_a, qb.qt_b ) AS root_tran
FROM query_test qb))
, hier
AS (SELECT DISTINCT tran, root_tran
FROM (SELECT tran, CONNECT_BY_ROOT root_tran AS root_tran
FROM t_order
CONNECT BY NOCYCLE PRIOR tran = root_tran)
WHERE tran >= root_tran)
SELECT qt.qt_id
, qt.qt_a
, MIN( LEAST( h1.root_tran, h2.root_tran ) ) AS master_tran
FROM query_test qt
INNER JOIN hier h1 ON qt.qt_a = h1.tran
INNER JOIN hier h2 ON qt.qt_b = h2.tran
GROUP BY qt.qt_id, qt.qt_a
ORDER BY qt.qt_id, qt.qt_a;
我无法测试最后一个语句。
我可能已经创建了其他解决方案。
除了使用 CONNECT BY 语句,您还可以将 link 加倍,并在需要时随时加倍。
检索所有 link 的查询保持不变,但后面跟着一个简单的查询,用两个 link 的所有不同组合替换原始 link。
包括由tran_a和tran_b组成的link,你有2 + 1 + 2 link,所以你最多可以找到5条link很长。
如果那太短,你在前一个子查询下插入一个相同的子查询,现在它是 4 + 1 + 4 使 9 links 长。
如您所见,每个添加的子查询的最大路径长度都会增加一倍,性能成本只会增加一些。
首先查询以检查您的演示数据:
WITH double_0
AS (SELECT DISTINCT root_tran, tran
FROM ( SELECT LEAST( td_0.tran_a, td_0.tran_b ) AS root_tran
, GREATEST( td_0.tran_a, td_0.tran_b ) AS tran
FROM tran_demo td_0
UNION
SELECT GREATEST( qb.tran_a, qb.tran_b ) AS root_tran
, LEAST( qb.tran_a, qb.tran_b ) AS tran
FROM tran_demo qb ))
, double_1
AS (SELECT DISTINCT oa.root_tran, ob.tran
FROM double_0 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
SELECT td_1.td_id
, td_1.tran_a
, MIN( LEAST( d1.root_tran, d2.root_tran ) ) AS master_tran
FROM tran_demo td_1
INNER JOIN double_1 d1 ON td_1.tran_a = d1.tran
INNER JOIN double_1 d2 ON td_1.tran_b = d2.tran
GROUP BY td_1.td_id, td_1.tran_a
ORDER BY td_1.td_id, td_1.tran_a;
那么你如何修改它:
请注意,您现在在最终查询中查询 double_2。
WITH double_0
AS (SELECT DISTINCT root_tran, tran
FROM ( SELECT LEAST( td_0.tran_a, td_0.tran_b ) AS root_tran
, GREATEST( td_0.tran_a, td_0.tran_b ) AS tran
FROM tran_demo td_0
UNION
SELECT GREATEST( qb.tran_a, qb.tran_b ) AS root_tran
, LEAST( qb.tran_a, qb.tran_b ) AS tran
FROM tran_demo qb ))
, double_1
AS (SELECT DISTINCT oa.root_tran, ob.tran
FROM double_0 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
, double_2
AS (SELECT DISTINCT oa.root_tran, ob.tran
FROM double_1 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
SELECT td_1.td_id
, td_1.tran_a
, MIN( LEAST( d1.root_tran, d2.root_tran ) ) AS master_tran
FROM tran_demo td_1
INNER JOIN double_2 d1 ON td_1.tran_a = d1.tran
INNER JOIN double_2 d2 ON td_1.tran_b = d2.tran
GROUP BY td_1.td_id, td_1.tran_a
ORDER BY td_1.td_id, td_1.tran_a;
最后一个查询来检查您使用的路径长度是否仍然足够:
您已经添加了下一个级别并减去当前级别。
只要此查询没有 return 任何行,当前查询就是正确的。
WITH double_0
AS (SELECT DISTINCT root_tran, tran
FROM ( SELECT LEAST( td_0.tran_a, td_0.tran_b ) AS root_tran
, GREATEST( td_0.tran_a, td_0.tran_b ) AS tran
FROM tran_demo td_0
UNION
SELECT GREATEST( qb.tran_a, qb.tran_b ) AS root_tran
, LEAST( qb.tran_a, qb.tran_b ) AS tran
FROM tran_demo qb ))
, double_1
AS (SELECT DISTINCT oa.root_tran, ob.tran
FROM double_0 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
, double_2
AS (SELECT DISTINCT oa.root_tran, ob.tran
FROM double_1 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
SELECT td_1.tran_a
, MIN( LEAST( d1.root_tran, d2.root_tran ) ) AS master_tran
FROM tran_demo td_1
INNER JOIN double_2 d1 ON td_1.tran_a = d1.tran
INNER JOIN double_2 d2 ON td_1.tran_b = d2.tran
GROUP BY td_1.tran_a
MINUS
SELECT td_2.tran_a
, MIN( LEAST( d1.root_tran, d2.root_tran ) ) AS master_tran
FROM tran_demo td_2
INNER JOIN double_1 d1 ON td_2.tran_a = d1.tran
INNER JOIN double_1 d2 ON td_2.tran_b = d2.tran
GROUP BY td_2.tran_a
ORDER BY tran_a;
您必须自己进行性能测试。
我很乐观,因为子查询很便宜,每次有效路径长度都会加倍。
这迟早会变得比以前的解决方案更快。
顺便说一句,关于排序原始 links 的评论在这里也有效!
如果有效请标记我的答案