从 PIG 中的数据中取 MIN EFF_DT 和 MAX_CANC_dt
Take MIN EFF_DT and MAX_CANC_dt from data in PIG
架构:
TYP|ID|RECORD|SEX|EFF_DT|CANC_DT
DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
假设我有多个这样的记录。我只想显示具有最小 eff_dt 和最大取消日期的记录。
我只想显示这1条记录
DMF|1234567|98765432|M|2011-04-30|9999-12-31
谢谢
获取最小 eff_dt 和最大 canc_dt 并用它来过滤 relation.Assuming 你有关系 A
B = GROUP A ALL;
X = FOREACH B GENERATE MIN(A.EFF_DT);
Y = FOREACH B GENERATE MAX(A.CANC_DT);
C = FILTER A BY ((EFF_DT == X.[=10=]) AND (CANC_DT == Y.[=10=]));
D = DISTINCT C;
DUMP D;
假设您有此数据(此处示例):
DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-12-30|9999-12-31
DMX|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-04-01|9999-12-31
执行这些步骤:
-- 1. Read data, if you have not
A = load 'data.txt' using PigStorage('|') as (typ: chararray, id:chararray, record:chararray, sex:chararray, eff_dt:datetime, canc_dt:datetime);
-- 2. Group data by the attribute you like to, in this case it is TYP
grouped = group A by typ;
-- 3. Now, generate MIN/MAX for each group. Also, only keep relevant fields
min_max = foreach grouped generate group, MIN(A.eff_dt) as min_eff_dt, MAX(A.canc_dt) as max_canc_dt;
--
dump min_max;
(DMF,2011-04-30T00:00:00.000Z,9999-12-31T00:00:00.000Z)
(DMX,2011-04-01T00:00:00.000Z,9999-12-31T00:00:00.000Z)
如果需要,将日期时间更改为 charrary。
注意:有不同的方法可以做到这一点,我所展示的,除了加载步骤外,它分两步产生所需的结果:GROUP 和 FOREACH。
架构:
TYP|ID|RECORD|SEX|EFF_DT|CANC_DT
DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
假设我有多个这样的记录。我只想显示具有最小 eff_dt 和最大取消日期的记录。
我只想显示这1条记录
DMF|1234567|98765432|M|2011-04-30|9999-12-31
谢谢
获取最小 eff_dt 和最大 canc_dt 并用它来过滤 relation.Assuming 你有关系 A
B = GROUP A ALL;
X = FOREACH B GENERATE MIN(A.EFF_DT);
Y = FOREACH B GENERATE MAX(A.CANC_DT);
C = FILTER A BY ((EFF_DT == X.[=10=]) AND (CANC_DT == Y.[=10=]));
D = DISTINCT C;
DUMP D;
假设您有此数据(此处示例):
DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-12-30|9999-12-31
DMX|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-04-01|9999-12-31
执行这些步骤:
-- 1. Read data, if you have not
A = load 'data.txt' using PigStorage('|') as (typ: chararray, id:chararray, record:chararray, sex:chararray, eff_dt:datetime, canc_dt:datetime);
-- 2. Group data by the attribute you like to, in this case it is TYP
grouped = group A by typ;
-- 3. Now, generate MIN/MAX for each group. Also, only keep relevant fields
min_max = foreach grouped generate group, MIN(A.eff_dt) as min_eff_dt, MAX(A.canc_dt) as max_canc_dt;
--
dump min_max;
(DMF,2011-04-30T00:00:00.000Z,9999-12-31T00:00:00.000Z)
(DMX,2011-04-01T00:00:00.000Z,9999-12-31T00:00:00.000Z)
如果需要,将日期时间更改为 charrary。
注意:有不同的方法可以做到这一点,我所展示的,除了加载步骤外,它分两步产生所需的结果:GROUP 和 FOREACH。