如何在 postgresql 中使用 PL/pgSQL 将原始 table 转换为所需的两个 table?

How to use PL/pgSQL to transform an original table to desired two tables in postgresql?

假设我们有这个 table things_happened 架构看起来像这样:

CREATE TABLE things_happened
(
  zipcode character varying(10),
  city character varying(50),
  state character varying(2),
  metro character varying(50),
  countyname character varying(50),
  "1996-04" integer,
  "1996-05" integer,
  "1996-06" integer,
  "1996-07" integer,
  "1996-08" integer,
  "1996-09" integer,
  ...
  "2014-09" integer,
  "2014-10" integer,
  "2014-11" integer
)

看起来很有趣,因为数据是由其他人从 csv 文件导入的。

很明显,这个table效率不高,对于特定区域,很多月份的值都是空的。所以我想用它创建两个 table。

所需的两个 table 的架构是:

area_info (zipcode, city, state, metro, countyname) with zipcode as primary key
things_happened_per_month (year, month, zipcode, times) with year, month, zipcode as primary key

因为table的size很大,而且数据源源不断,列名又要成为参数,所以我想知道如何用"PL/pgSQL - SQL Procedural Language"来实现?或者任何其他有效的解决方案?

您的 table things_happened 看起来像一个支点 table 并且您想将其规范化为更高效的数据结构。您将必须编写一个 PlPgSQL 函数来执行此操作。

由于您有很多个月,并且可能会为以后的月份添加更多列,因此我建议您动态确定 table 中的月份列,然后遍历结果。在下面的示例中,我假设您已经将 area_info 数据复制到它自己的 table 中;我在这里关注 thpm table 中的 "times" 列(我假设您已经创建了)。

动态解决方案

下面的函数使用 table 中的 YYYY-DD 列的动态查找,然后遍历记录和列以将数据放入规范化的 table。 (非常感谢 Pavel Stehule 指出了代码中最后一个挑剔的错误。)

CREATE FUNCTION normalize_things_happened() RETURNS void AS $$
DECLARE
  col_names text[];
  period    text;
  th        things_happened%rowtype;
  times     integer;
BEGIN
  -- Get the currently present columns from the catalog
  SELECT array_agg(attname::text) INTO col_names
  FROM pg_attribute att
  JOIN pg_class c ON c.oid = att.attrelid
  JOIN pg_namespace n ON n.oid = c.relnamespace
  WHERE c.relname = 'things_happened'
    AND n.nspname = 'public'
    AND position('-' in attname) = 5; -- only "times" columns

  -- Loop over all the rows in the things_happened table
  FOR th IN SELECT * FROM things_happened LOOP
    -- Now loop over column names
    FOREACH period IN ARRAY col_names LOOP
      -- Fudge the proper column from the th record into a local variable
      EXECUTE 'SELECT .' || quote_ident(period) INTO times USING th;

      -- If times is a proper value, insert it into the thpm table
      IF times IS NOT NULL THEN
        INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES
          (substring(period from 1 for 4)::int, substring(period from 6 for 2)::int, th.zipcode, times);
      END IF;
    END LOOP;
  END LOOP;
END; $$ LANGUAGE plpgsql;

这应该作为一次性练习。如果原来的table一直在获取新的数据,你应该运行周期性的做这个函数,然后在最里面EXECUTE做一个UPSERT:先尝试做一个[=14] =] 的 "times" 值,如果由于没有年、月、邮政编码组合的数据而失败,则执行 INSERT。有关示例,请参阅此处的许多其他问题。

静态解决方案

以下函数是非动态变体。您必须在 things_happened table.

中为每个月放置单独的代码块
CREATE FUNCTION normalize_things_happened() RETURNS void AS $$
DECLARE
  th        things_happened%rowtype;
  times     integer;
BEGIN
  -- Loop over all the rows in the things_happened table
  FOR th IN SELECT * FROM things_happened LOOP
    -- Copy the below block for 1996, April, for all other months.
    SELECT th."1996-04" INTO times;
    IF times IS NOT NULL THEN
      INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES (1996, 4, th.zipcode, times);
    -- 1996, May
    SELECT th."1996-05" INTO times;
    IF times IS NOT NULL THEN
      INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES (1996, 5, th.zipcode, times);
    END IF;
    -- Etc.
  END LOOP;
END; $$ LANGUAGE plpgsql;

丑陋,但实用。