如何在 postgresql 中使用 PL/pgSQL 将原始 table 转换为所需的两个 table?
How to use PL/pgSQL to transform an original table to desired two tables in postgresql?
假设我们有这个 table things_happened 架构看起来像这样:
CREATE TABLE things_happened
(
zipcode character varying(10),
city character varying(50),
state character varying(2),
metro character varying(50),
countyname character varying(50),
"1996-04" integer,
"1996-05" integer,
"1996-06" integer,
"1996-07" integer,
"1996-08" integer,
"1996-09" integer,
...
"2014-09" integer,
"2014-10" integer,
"2014-11" integer
)
看起来很有趣,因为数据是由其他人从 csv 文件导入的。
很明显,这个table效率不高,对于特定区域,很多月份的值都是空的。所以我想用它创建两个 table。
所需的两个 table 的架构是:
area_info (zipcode, city, state, metro, countyname) with zipcode as primary key
things_happened_per_month (year, month, zipcode, times) with year, month, zipcode as primary key
因为table的size很大,而且数据源源不断,列名又要成为参数,所以我想知道如何用"PL/pgSQL - SQL Procedural Language"来实现?或者任何其他有效的解决方案?
您的 table things_happened 看起来像一个支点 table 并且您想将其规范化为更高效的数据结构。您将必须编写一个 PlPgSQL 函数来执行此操作。
由于您有很多个月,并且可能会为以后的月份添加更多列,因此我建议您动态确定 table 中的月份列,然后遍历结果。在下面的示例中,我假设您已经将 area_info 数据复制到它自己的 table 中;我在这里关注 thpm table 中的 "times" 列(我假设您已经创建了)。
动态解决方案
下面的函数使用 table 中的 YYYY-DD 列的动态查找,然后遍历记录和列以将数据放入规范化的 table。 (非常感谢 Pavel Stehule 指出了代码中最后一个挑剔的错误。)
CREATE FUNCTION normalize_things_happened() RETURNS void AS $$
DECLARE
col_names text[];
period text;
th things_happened%rowtype;
times integer;
BEGIN
-- Get the currently present columns from the catalog
SELECT array_agg(attname::text) INTO col_names
FROM pg_attribute att
JOIN pg_class c ON c.oid = att.attrelid
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = 'things_happened'
AND n.nspname = 'public'
AND position('-' in attname) = 5; -- only "times" columns
-- Loop over all the rows in the things_happened table
FOR th IN SELECT * FROM things_happened LOOP
-- Now loop over column names
FOREACH period IN ARRAY col_names LOOP
-- Fudge the proper column from the th record into a local variable
EXECUTE 'SELECT .' || quote_ident(period) INTO times USING th;
-- If times is a proper value, insert it into the thpm table
IF times IS NOT NULL THEN
INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES
(substring(period from 1 for 4)::int, substring(period from 6 for 2)::int, th.zipcode, times);
END IF;
END LOOP;
END LOOP;
END; $$ LANGUAGE plpgsql;
这应该作为一次性练习。如果原来的table一直在获取新的数据,你应该运行周期性的做这个函数,然后在最里面EXECUTE
做一个UPSERT
:先尝试做一个[=14] =] 的 "times" 值,如果由于没有年、月、邮政编码组合的数据而失败,则执行 INSERT
。有关示例,请参阅此处的许多其他问题。
静态解决方案
以下函数是非动态变体。您必须在 things_happened table.
中为每个月放置单独的代码块
CREATE FUNCTION normalize_things_happened() RETURNS void AS $$
DECLARE
th things_happened%rowtype;
times integer;
BEGIN
-- Loop over all the rows in the things_happened table
FOR th IN SELECT * FROM things_happened LOOP
-- Copy the below block for 1996, April, for all other months.
SELECT th."1996-04" INTO times;
IF times IS NOT NULL THEN
INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES (1996, 4, th.zipcode, times);
-- 1996, May
SELECT th."1996-05" INTO times;
IF times IS NOT NULL THEN
INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES (1996, 5, th.zipcode, times);
END IF;
-- Etc.
END LOOP;
END; $$ LANGUAGE plpgsql;
丑陋,但实用。
假设我们有这个 table things_happened 架构看起来像这样:
CREATE TABLE things_happened
(
zipcode character varying(10),
city character varying(50),
state character varying(2),
metro character varying(50),
countyname character varying(50),
"1996-04" integer,
"1996-05" integer,
"1996-06" integer,
"1996-07" integer,
"1996-08" integer,
"1996-09" integer,
...
"2014-09" integer,
"2014-10" integer,
"2014-11" integer
)
看起来很有趣,因为数据是由其他人从 csv 文件导入的。
很明显,这个table效率不高,对于特定区域,很多月份的值都是空的。所以我想用它创建两个 table。
所需的两个 table 的架构是:
area_info (zipcode, city, state, metro, countyname) with zipcode as primary key
things_happened_per_month (year, month, zipcode, times) with year, month, zipcode as primary key
因为table的size很大,而且数据源源不断,列名又要成为参数,所以我想知道如何用"PL/pgSQL - SQL Procedural Language"来实现?或者任何其他有效的解决方案?
您的 table things_happened 看起来像一个支点 table 并且您想将其规范化为更高效的数据结构。您将必须编写一个 PlPgSQL 函数来执行此操作。
由于您有很多个月,并且可能会为以后的月份添加更多列,因此我建议您动态确定 table 中的月份列,然后遍历结果。在下面的示例中,我假设您已经将 area_info 数据复制到它自己的 table 中;我在这里关注 thpm table 中的 "times" 列(我假设您已经创建了)。
动态解决方案
下面的函数使用 table 中的 YYYY-DD 列的动态查找,然后遍历记录和列以将数据放入规范化的 table。 (非常感谢 Pavel Stehule 指出了代码中最后一个挑剔的错误。)
CREATE FUNCTION normalize_things_happened() RETURNS void AS $$
DECLARE
col_names text[];
period text;
th things_happened%rowtype;
times integer;
BEGIN
-- Get the currently present columns from the catalog
SELECT array_agg(attname::text) INTO col_names
FROM pg_attribute att
JOIN pg_class c ON c.oid = att.attrelid
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = 'things_happened'
AND n.nspname = 'public'
AND position('-' in attname) = 5; -- only "times" columns
-- Loop over all the rows in the things_happened table
FOR th IN SELECT * FROM things_happened LOOP
-- Now loop over column names
FOREACH period IN ARRAY col_names LOOP
-- Fudge the proper column from the th record into a local variable
EXECUTE 'SELECT .' || quote_ident(period) INTO times USING th;
-- If times is a proper value, insert it into the thpm table
IF times IS NOT NULL THEN
INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES
(substring(period from 1 for 4)::int, substring(period from 6 for 2)::int, th.zipcode, times);
END IF;
END LOOP;
END LOOP;
END; $$ LANGUAGE plpgsql;
这应该作为一次性练习。如果原来的table一直在获取新的数据,你应该运行周期性的做这个函数,然后在最里面EXECUTE
做一个UPSERT
:先尝试做一个[=14] =] 的 "times" 值,如果由于没有年、月、邮政编码组合的数据而失败,则执行 INSERT
。有关示例,请参阅此处的许多其他问题。
静态解决方案
以下函数是非动态变体。您必须在 things_happened table.
中为每个月放置单独的代码块CREATE FUNCTION normalize_things_happened() RETURNS void AS $$
DECLARE
th things_happened%rowtype;
times integer;
BEGIN
-- Loop over all the rows in the things_happened table
FOR th IN SELECT * FROM things_happened LOOP
-- Copy the below block for 1996, April, for all other months.
SELECT th."1996-04" INTO times;
IF times IS NOT NULL THEN
INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES (1996, 4, th.zipcode, times);
-- 1996, May
SELECT th."1996-05" INTO times;
IF times IS NOT NULL THEN
INSERT INTO things_happened_per_month (year, month, zipcode, times) VALUES (1996, 5, th.zipcode, times);
END IF;
-- Etc.
END LOOP;
END; $$ LANGUAGE plpgsql;
丑陋,但实用。