Apache Pig 的时差?
Time differences in Apache Pig?
在大数据环境中,我有一个时间序列 S1=(t1, t2, t3 ...) 按升序排序。我想产生一系列时间差:S2=(t2-t1, t3-t2 ...)
有没有办法在 Apache Pig 中做到这一点?缺一个很
低效的自我加入,我没有看到一个。
如果没有,有什么适合大量使用的好方法
数据?
- S1 = 生成 Id、时间戳,即从 t1...tn
- S2 = 生成 Id、时间戳,即从 t2...tn
- S3 = 通过 Id 加入 S1,通过 Id 加入 S2
- S4 = 提取 S1.Timestamp,S2.Timestamp,(S2.TimeStamp - S1.TimeStamp)
编辑
示例数据
2014-02-19T01:03:37
2014-02-26T01:03:39
2014-02-28T01:03:45
2014-04-01T01:04:22
2014-05-11T01:06:02
2014-06-30T01:08:56
脚本
s1 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
s11 = foreach s1 generate ToDate(t) as t1;
s1_new = rank s11;
s2 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
s22 = foreach s2 generate ToDate(t) as t1;
s2_new = rank s22;
-- Filter records by excluding the 1 ranked row and rank the new data
ss = FILTER s2_new by (rank_s22 > 1);
ss_new = rank ss;
s3 = join s1_new by rank_s11,ss_new by rank_ss;
s4 = foreach s3 generate DaysBetween(ss_new::t1,s1_new::t1) as time_diff;
DUMP s4;
在大数据环境中,我有一个时间序列 S1=(t1, t2, t3 ...) 按升序排序。我想产生一系列时间差:S2=(t2-t1, t3-t2 ...)
有没有办法在 Apache Pig 中做到这一点?缺一个很 低效的自我加入,我没有看到一个。
如果没有,有什么适合大量使用的好方法 数据?
- S1 = 生成 Id、时间戳,即从 t1...tn
- S2 = 生成 Id、时间戳,即从 t2...tn
- S3 = 通过 Id 加入 S1,通过 Id 加入 S2
- S4 = 提取 S1.Timestamp,S2.Timestamp,(S2.TimeStamp - S1.TimeStamp)
编辑
示例数据
2014-02-19T01:03:37
2014-02-26T01:03:39
2014-02-28T01:03:45
2014-04-01T01:04:22
2014-05-11T01:06:02
2014-06-30T01:08:56
脚本
s1 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
s11 = foreach s1 generate ToDate(t) as t1;
s1_new = rank s11;
s2 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
s22 = foreach s2 generate ToDate(t) as t1;
s2_new = rank s22;
-- Filter records by excluding the 1 ranked row and rank the new data
ss = FILTER s2_new by (rank_s22 > 1);
ss_new = rank ss;
s3 = join s1_new by rank_s11,ss_new by rank_ss;
s4 = foreach s3 generate DaysBetween(ss_new::t1,s1_new::t1) as time_diff;
DUMP s4;