从 PIG 中的数据中取 MIN EFF_DT 和 MAX_CANC_dt

Take MIN EFF_DT and MAX_CANC_dt from data in PIG

架构:

TYP|ID|RECORD|SEX|EFF_DT|CANC_DT

DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31

假设我有多个这样的记录。我只想显示具有最小 eff_dt 和最大取消日期的记录。

我只想显示这1条记录

DMF|1234567|98765432|M|2011-04-30|9999-12-31

谢谢

获取最小 eff_dt 和最大 canc_dt 并用它来过滤 relation.Assuming 你有关系 A

B = GROUP A ALL;
X = FOREACH B GENERATE MIN(A.EFF_DT);
Y = FOREACH B GENERATE MAX(A.CANC_DT);

C = FILTER A BY ((EFF_DT == X.[=10=]) AND (CANC_DT == Y.[=10=]));
D = DISTINCT C;
DUMP D; 

假设您有此数据(此处示例):

DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-12-30|9999-12-31
DMX|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-04-01|9999-12-31

执行这些步骤:

-- 1. Read data, if you have not 
 A = load 'data.txt' using PigStorage('|') as (typ: chararray, id:chararray, record:chararray, sex:chararray, eff_dt:datetime, canc_dt:datetime);

-- 2. Group data by the attribute you like to, in this case it is TYP
grouped = group A by typ;

-- 3. Now, generate MIN/MAX for each group. Also, only keep relevant fields
min_max = foreach grouped generate group, MIN(A.eff_dt) as min_eff_dt, MAX(A.canc_dt) as max_canc_dt;

-- 
dump min_max;
(DMF,2011-04-30T00:00:00.000Z,9999-12-31T00:00:00.000Z)
(DMX,2011-04-01T00:00:00.000Z,9999-12-31T00:00:00.000Z)

如果需要,将日期时间更改为 charrary。

注意:有不同的方法可以做到这一点,我所展示的,除了加载步骤外,它分两步产生所需的结果:GROUP 和 FOREACH。