从链接的行中提取行族

Extract row families from linked rows

我有一个 table 的关联交易类似于以下 table

+----+----+----+
| #  | A  | B  |
+----+----+----+
| 1  | 1  | 4  |
| 2  | 3  | 5  |
| 3  | 4  | 6  |
| 4  | 5  | 8  |
| 5  | 6  | 1  |
| 6  | 7  | 7  |
| 7  | 8  | 3  |
| 8  | 9  | 3  |
| 9  | 10 | 4  |
| 10 | 11 | 14 |
| 11 | 2  | 2  |
| 12 | 12 | 4  |
| 13 | 13 | 14 |
| 14 | 14 | 9  |
| 15 | 15 | 1  |
+----+----+----+

A列和B列下的数字代表交易ID。因此,例如,交易 1 通过某些标准与交易 4 相关联,tran 3 与 tran 5 相关联,tran 4 与 tran 6 相关联,依此类推。

事务 2 和 7 未链接到任何其他事务,因此它们是自链接的。

我想从中提取交易系列 table- 因为 tran 1 和 4 是链接的,tran 4 和 6 是链接的,tran 10 和 4 是链接的等等,它们属于一个交易系列 - (1,4,6,10,12,15)。

我想创建交易 ID 最低的交易系列作为主交易。 所以理想情况下,输出将如下所示

+----+------+--------------+
| #  | Tran | Master_tran  |
+----+------+--------------+
| 1  | 1    | 1  |
| 2  | 3    | 3  |         
| 3  | 4    | 1  |
| 4  | 5    | 3  |
| 5  | 6    | 1  |
| 6  | 7    | 7  |
| 7  | 8    | 3  |
| 8  | 9    | 3  |
| 9  | 10   | 1  |
| 10 | 11   | 3  |
| 11 | 2    | 2  |
| 12 | 12   | 1  |
| 13 | 13   | 3  |
| 14 | 14   | 3  |
| 15 | 15   | 1  |
+----+------+----+

我一直在研究自连接。

SELECT     t1.a as x, 
           least (min(t1.b), min(t2.a)) as y  
FROM       test   t1 
LEFT JOIN  test   t2 on t2.b = t1.a  
GROUP BY   t1.a 
ORDER BY   t1.a asc

此代码给出以下输出

+------+----+---+
| Col1 | X  | Y |
+------+----+---+
|    1 |  1 | 4 |
|    2 |  2 | 2 |
|    3 |  3 | 5 |
|    4 |  4 | 1 |
|    5 |  5 | 3 |
|    6 |  6 | 1 |
|    7 |  7 | 7 |
|    8 |  8 | 3 |
|    9 |  9 | 3 |
|   10 | 10 |   |
|   11 | 11 |   |
|   12 | 12 |   |
|   13 | 13 |   |
|   14 | 14 | 9 |
|   15 | 15 |   |
+------+----+---+

我不确定我的代码有什么问题。有人能指出我正确的方向吗? 谢谢!

原则上你需要一个CONNECT BY语句来解决这样的层级问题。 当你有循环时,你还需要一个 NOCYCLE 子句,这将消除循环中的最后一个 link,这很好,因为 link 永远不会成为答案的一部分。 您在两个方向 (f.e. (13, 14) 和 (14, 9)) 也有 links,因此您必须小心地将其包含在查询中(两次!)。

WITH t_order
     AS (SELECT qt.qt_id, qt.qt_a, qt.qt_b, LEAST( qt.qt_a, qt.qt_b ) AS t_parent, GREATEST( qt.qt_a, qt.qt_b ) AS t_child
       FROM query_test qt
     UNION
     SELECT qb.qt_id, qb.qt_a, qb.qt_b, GREATEST( qb.qt_a, qb.qt_b ) AS t_parent, LEAST( qb.qt_a, qb.qt_b ) AS t_child
       FROM query_test qb)
, hier
  AS (SELECT     ps.qt_id
              , ps.qt_a
              , ps.qt_b
              , t_parent
              , t_child
              , LEVEL
              , CONNECT_BY_ROOT t_parent AS prev_tran
           FROM t_order ps
     CONNECT BY NOCYCLE PRIOR t_child = t_parent)
SELECT   hr.qt_id, hr.qt_a, MIN( hr.prev_tran ) AS master_tran
  FROM hier hr
GROUP BY hr.qt_id, hr.qt_a
ORDER BY hr.qt_id, hr.qt_a;

这将解决您的问题,但如果必须处理这 100.000 条记录,可能会变得非常慢。如果您需要将此方法与许多其他列结合使用,SQL 语句也会变得难以理解。为此,您应该分解出所有 qt.qt 列并在最后一个 select.

中加入它们
WITH t_order
     AS (SELECT DISTINCT tran, root_tran
           FROM (SELECT LEAST( qt.qt_a, qt.qt_b ) AS tran, GREATEST( qt.qt_a, qt.qt_b ) AS root_tran
                   FROM query_test qt
                 UNION
                 SELECT GREATEST( qb.qt_a, qb.qt_b ) AS tran, LEAST( qb.qt_a, qb.qt_b ) AS root_tran
                   FROM query_test qb))
   , hier
     AS (SELECT DISTINCT tran, root_tran
           FROM (SELECT     tran, CONNECT_BY_ROOT root_tran AS root_tran
                       FROM t_order
                 CONNECT BY NOCYCLE PRIOR tran = root_tran)
          WHERE tran >= root_tran)
SELECT   qt.qt_id
       , qt.qt_a
       , MIN( LEAST( h1.root_tran, h2.root_tran ) ) AS master_tran
    FROM query_test qt
         INNER JOIN hier h1 ON qt.qt_a = h1.tran
         INNER JOIN hier h2 ON qt.qt_b = h2.tran
GROUP BY qt.qt_id, qt.qt_a
ORDER BY qt.qt_id, qt.qt_a;

我无法测试最后一个语句。

我可能已经创建了其他解决方案。
除了使用 CONNECT BY 语句,您还可以将 link 加倍,并在需要时随时加倍。 检索所有 link 的查询保持不变,但后面跟着一个简单的查询,用两个 link 的所有不同组合替换原始 link。
包括由tran_a和tran_b组成的link,你有2 + 1 + 2 link,所以你最多可以找到5条link很长。 如果那太短,你在前一个子查询下插入一个相同的子查询,现在它是 4 + 1 + 4 使 9 links 长。 如您所见,每个添加的子查询的最大路径长度都会增加一倍,性能成本只会增加一些。

首先查询以检查您的演示数据:

WITH double_0
     AS (SELECT DISTINCT root_tran, tran
           FROM ( SELECT LEAST( td_0.tran_a, td_0.tran_b ) AS root_tran
                       , GREATEST( td_0.tran_a, td_0.tran_b ) AS tran
                    FROM tran_demo td_0
                  UNION
                  SELECT GREATEST( qb.tran_a, qb.tran_b ) AS root_tran
                       , LEAST( qb.tran_a, qb.tran_b ) AS tran
                    FROM tran_demo qb ))
   , double_1
     AS (SELECT DISTINCT oa.root_tran, ob.tran
           FROM double_0 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
SELECT   td_1.td_id
       , td_1.tran_a
       , MIN( LEAST( d1.root_tran, d2.root_tran ) ) AS master_tran
    FROM tran_demo td_1
         INNER JOIN double_1 d1 ON td_1.tran_a = d1.tran
         INNER JOIN double_1 d2 ON td_1.tran_b = d2.tran
GROUP BY td_1.td_id, td_1.tran_a
ORDER BY td_1.td_id, td_1.tran_a;

那么你如何修改它:
请注意,您现在在最终查询中查询 double_2

WITH double_0
     AS (SELECT DISTINCT root_tran, tran
           FROM ( SELECT LEAST( td_0.tran_a, td_0.tran_b ) AS root_tran
                       , GREATEST( td_0.tran_a, td_0.tran_b ) AS tran
                    FROM tran_demo td_0
                  UNION
                  SELECT GREATEST( qb.tran_a, qb.tran_b ) AS root_tran
                       , LEAST( qb.tran_a, qb.tran_b ) AS tran
                    FROM tran_demo qb ))
   , double_1
     AS (SELECT DISTINCT oa.root_tran, ob.tran
           FROM double_0 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
   , double_2
     AS (SELECT DISTINCT oa.root_tran, ob.tran
           FROM double_1 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
SELECT   td_1.td_id
       , td_1.tran_a
       , MIN( LEAST( d1.root_tran, d2.root_tran ) ) AS master_tran
    FROM tran_demo td_1
         INNER JOIN double_2 d1 ON td_1.tran_a = d1.tran
         INNER JOIN double_2 d2 ON td_1.tran_b = d2.tran
GROUP BY td_1.td_id, td_1.tran_a
ORDER BY td_1.td_id, td_1.tran_a;

最后一个查询来检查您使用的路径长度是否仍然足够: 您已经添加了下一个级别并减去当前级别。
只要此查询没有 return 任何行,当前查询就是正确的。

WITH double_0
     AS (SELECT DISTINCT root_tran, tran
           FROM ( SELECT LEAST( td_0.tran_a, td_0.tran_b ) AS root_tran
                       , GREATEST( td_0.tran_a, td_0.tran_b ) AS tran
                    FROM tran_demo td_0
                  UNION
                  SELECT GREATEST( qb.tran_a, qb.tran_b ) AS root_tran
                       , LEAST( qb.tran_a, qb.tran_b ) AS tran
                    FROM tran_demo qb ))
   , double_1
     AS (SELECT DISTINCT oa.root_tran, ob.tran
           FROM double_0 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
   , double_2
     AS (SELECT DISTINCT oa.root_tran, ob.tran
           FROM double_1 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
SELECT   td_1.tran_a
       , MIN( LEAST( d1.root_tran, d2.root_tran ) ) AS master_tran
    FROM tran_demo td_1
         INNER JOIN double_2 d1 ON td_1.tran_a = d1.tran
         INNER JOIN double_2 d2 ON td_1.tran_b = d2.tran
GROUP BY td_1.tran_a
MINUS
SELECT   td_2.tran_a
       , MIN( LEAST( d1.root_tran, d2.root_tran ) ) AS master_tran
    FROM tran_demo td_2
         INNER JOIN double_1 d1 ON td_2.tran_a = d1.tran
         INNER JOIN double_1 d2 ON td_2.tran_b = d2.tran
GROUP BY td_2.tran_a
ORDER BY tran_a;

您必须自己进行性能测试。 我很乐观,因为子查询很便宜,每次有效路径长度都会加倍。 这迟早会变得比以前的解决方案更快。
顺便说一句,关于排序原始 links 的评论在这里也有效!
如果有效请标记我的答案