用递归或函数替换 SELECT 中的迭代 INSERT 以遍历 Postgres 中的路径

Replacing iterative INSERT from SELECT with recursion or functions to traverse paths in Postgres

在我的架构中,一个 dataset 有很多 cfilescfile 有一个 数据集。每个 cfile 也有一些动态 属性 值存储在 jsonb.

Fiddle here

SELECT * FROM datasets;
 id |   name   
----+----------
  1 | Dataset1
  2 | Dataset2


SELECT * FROM cfiles WHERE dataset_id=1;
 id | dataset_id |            path             |                     property_values                      
----+------------+-----------------------------+----------------------------------------------------------
  1 |          1 | dir_i/file1.txt             | {"Project": "ProjW", "Sample Names": ["sampA", "sampB"]}
  2 |          1 | dir_i/dir_j/file2.txt       | {"Project": "ProjX", "Sample Names": ["sampA", "sampC"]}
  3 |          1 | dir_i/dir_j/dir_k/file3.txt | {"Project": "ProjY", "Sample Names": ["sampD"]}
  4 |          1 | dir_m/file4.txt             | {"Project": "ProjZ", "Sample Names": ["sampE"]}

根据 SO 问题和出色的答案,我有以下查询:

INSERT into agg_prop_vals(dataset_id, path, sample_names, projects)
  SELECT DISTINCT
  cfiles.dataset_id,
  '.' as path,
  -- ** path specific:
  -- 'dir_i/dir_j/' as path,
  h."Sample Names", h."Project"
  FROM (
    SELECT
    dataset_id,
    string_agg(DISTINCT "Sample Names", '; ' ORDER BY "Sample Names") as "Sample Names",
    string_agg(DISTINCT "Project", '; ' ORDER BY "Project") as "Project"
    FROM (
      SELECT
      cfiles.dataset_id as dataset_id,
      property_values ->> 'Project' as "Project",
      jsonb_array_elements_text(property_values -> 'Sample Names') as "Sample Names"
      FROM cfiles
      WHERE cfiles.dataset_id=1
      -- ** path specific:
      -- AND cfiles.path LIKE 'dir_i/dir_j/%'
    ) g GROUP BY dataset_id
  ) h
  JOIN cfiles ON (cfiles.dataset_id=h.dataset_id)
  WHERE cfiles.dataset_id=1
  ON CONFLICT (dataset_id, path)
  DO UPDATE SET
    sample_names = excluded.sample_names,
    projects = excluded.projects

为特定数据集生成 table 聚合 cfile 属性 值:

SELECT * FROM agg_prop_vals;
 dataset_id |        path        |           sample_names            |          projects          
------------+--------------------+-----------------------------------+----------------------------
          1 | .                  | sampA; sampB; sampC; sampD; sampE | ProjW; ProjX; ProjY; ProjZ

现在这非常适合获取每个 数据集 的聚合值,但我现在还想获取每个 数据集+路径 的聚合值,所以像这样:

SELECT * FROM agg_prop_vals;
 dataset_id |        path        |           sample_names            |          projects          
------------+--------------------+-----------------------------------+----------------------------
          1 | .                  | sampA; sampB; sampC; sampD; sampE | ProjW; ProjX; ProjY; ProjZ
          1 | dir_i/             | sampA; sampB; sampC; sampD        | ProjW; ProjX; ProjY
          1 | dir_i/dir_j/       | sampA; sampC; sampD               | ProjX; ProjY
          1 | dir_i/dir_j/dir_k/ | sampD                             | ProjY
          1 | dir_m/             | sampE                             | ProjZ

所有处理一次完成一个数据集,因此我很乐意迭代数据集(因此 WHERE cfiles.dataset_id=1 可以是 ignored/treated 作为此示例的常量)。我遇到的问题是遍历路径。

我可以 运行 对数据集中的每条路径执行上面的相同查询(例如,取消注释 ** path specific:)但是当单个数据集中有数千个子路径时,这可能需要一个小时。例如:

("SELECT DISTINCT SUBSTRING(path, '(.*\/).*') FROM cfiles WHERE dataset_id=1").each do |sub_path|
  aggregate_query(sub_path)
end

但这也是低效的,因为它不是在每个级别使用已经计算的子目录聚合,而是在每个级别再次对所有子 cfiles 执行查询。

即计算:

 dataset_id |        path        |           sample_names            |          projects          
------------+--------------------+-----------------------------------+----------------------------
          1 | .                  | sampA; sampB; sampC; sampD; sampE | ProjW; ProjX; ProjY; ProjZ

它应该添加顶级子目录的预计算聚合:

 dataset_id |        path        |           sample_names            |          projects          
------------+--------------------+-----------------------------------+----------------------------
          1 | dir_i/             | sampA; sampB; sampC; sampD        | ProjW; ProjX; ProjY

加上:

 dataset_id |        path        |           sample_names            |          projects          
------------+--------------------+-----------------------------------+----------------------------
          1 | dir_m/             | sampE                             | ProjZ

而不是再次遍历所有子 cfiles。

有什么方法可以用某种查询或 PL/SQL 替换迭代,它使用递归或函数遍历目录路径并相应地填充 agg_prop_vals table ?

其他要点:

我会将大约 27k 行的“find /lib /bin /etc”的输出加载到 table...

BEGIN;
CREATE TABLE _files( path TEXT NOT NULL );
\copy _files (path) from 'files.txt';
CREATE TABLE files( 
  id SERIAL PRIMARY KEY, 
  path TEXT NOT NULL,
  dataset_id INTEGER NOT NULL,
  attrib TEXT[] NOT NULL
 );
INSERT INTO files (path,dataset_id,attrib) SELECT path,n,ARRAY[RIGHT(path,1),RIGHT(path,2)]
 FROM _files CROSS JOIN (SELECT generate_series(1,10) n) n;
COMMIT;
VACUUM ANALYZE files;
CREATE INDEX files_dataset ON files(dataset_id);

我添加了 generate_series 以将文件数乘以 10。

“attrib”列包含两个文本值,它们将成为您的“示例”。

我假设路径中没有双斜杠,并且所有路径都不以斜杠结尾。如果不是这种情况,您必须将其放在查询中的适当位置:

regexp_replace( regexp_replace( path, '(//+)', '/', 'g' ), '/$', '')

然后让我们添加一个 parent_path 列。 Postgres 正则表达式很慢,所以这需要一段时间。

CREATE TEMPORARY TABLE fp AS
SELECT *, regexp_replace( path, '/[^/]+$', '' ) AS parent_path 
FROM files WHERE dataset_id=1;

旁注:要在 SQL 中对 paths/trees 进行建模,您可以使用 parent_id,或者只是将路径粘贴在列中,但在这种情况下,数组比一个字符串,因为它很容易访问元素。

我在文件 table 中添加了一个类型为 TEXT[] 的“attrib”列,它模拟了上面聚合 sample_names 和项目的查询的结果。是数组,因为后面还要拆。

所以。现在我们必须构建一个目录树,包括其中没有文件的目录,这些目录不在上面生成的 parent_path 中,这意味着它们必须通过递归查询生成。因为SQL是SQL,它不是从根开始,而是从完整路径开始,然后倒过来。

CREATE TEMPORARY TABLE dirs (
      path TEXT UNIQUE NOT NULL,
      parent_path TEXT NOT NULL,
      attrib1 TEXT[] NULL,
      attrib2 TEXT[] NULL );

INSERT INTO dirs (path, parent_path)
WITH RECURSIVE pdirs AS (SELECT * FROM
  (SELECT parent_path AS path,
          regexp_replace( parent_path, '/[^/]+$', '' ) AS parent_path FROM fp
  ) x1
 UNION  SELECT * FROM
  (SELECT parent_path AS path,
          regexp_replace( parent_path, '/[^/]+$', '' ) AS parent_path FROM pdirs
  ) x2 WHERE path != '' OR parent_path != ''
 ) SELECT * FROM pdirs ORDER BY path;

不,对于 table fp 中的每一行,分解每一行中的属性数组,删除重复项,然后将其重新组合成一个数组。有两种方法可以做到这一点...第一种方法更快,但需要临时 table 上的索引。所以,让我们使用第二个。

SELECT dirs.path, (SELECT array_agg(a) FROM (SELECT DISTINCT unnest(attrib) a FROM fp WHERE fp.parent_path=dirs.path) x) FROM dirs;

SELECT parent_path, array_agg(DISTINCT att) FROM (SELECT parent_path, unnest(attrib) att FROM fp) x GROUP BY parent_path;

现在,“只需”递归地对 table 目录执行相同的操作,以沿路径传播属性...两次,一次用于示例,一次用于项目,因为无法引用递归 CTE在查询中不止一次...

WITH RECURSIVE rdirs AS (
  SELECT dirs.*, attrib FROM
  (SELECT parent_path, array_agg(DISTINCT att) attrib FROM (SELECT parent_path, unnest(attrib) att FROM fp) x GROUP BY parent_path) AS x
  JOIN dirs ON (dirs.path=x.parent_path)
UNION ALL
  SELECT dirs.*, attrib FROM
  (SELECT parent_path, array_agg(DISTINCT att) attrib FROM (SELECT parent_path, unnest(attrib) att FROM rdirs) x GROUP BY parent_path) AS x
  JOIN dirs ON (dirs.path=x.parent_path)
  WHERE dirs.path != '' OR dirs.parent_path != ''
)
UPDATE dirs 
SET attrib1=rdirs.attrib
FROM rdirs
WHERE dirs.path=rdirs.path;

因此,您对项目列再次执行此操作(相应地更改列名称),临时 table 目录应该包含所需的结果!

如果您喜欢挑战,很可能只需一个查询就可以完成所有这些操作,而不需要临时 tables!