Apache Pig 的时差?

Time differences in Apache Pig?

在大数据环境中,我有一个时间序列 S1=(t1, t2, t3 ...) 按升序排序。我想产生一系列时间差:S2=(t2-t1, t3-t2 ...)

  1. 有没有办法在 Apache Pig 中做到这一点?缺一个很 低效的自我加入,我没有看到一个。

  2. 如果没有,有什么适合大量使用的好方法 数据?

  1. S1 = 生成 Id、时间戳,即从 t1...tn
  2. S2 = 生成 Id、时间戳,即从 t2...tn
  3. S3 = 通过 Id 加入 S1,通过 Id 加入 S2
  4. S4 = 提取 S1.Timestamp,S2.Timestamp,(S2.TimeStamp - S1.TimeStamp)

编辑

示例数据

2014-02-19T01:03:37
2014-02-26T01:03:39
2014-02-28T01:03:45
2014-04-01T01:04:22
2014-05-11T01:06:02
2014-06-30T01:08:56

脚本

s1 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
s11 = foreach s1 generate ToDate(t) as t1;
s1_new = rank s11;

s2 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
s22 = foreach s2 generate ToDate(t) as t1;
s2_new = rank s22;

-- Filter records by excluding the 1 ranked row and rank the new data
ss = FILTER s2_new by (rank_s22 > 1);
ss_new = rank ss;

s3 = join s1_new by rank_s11,ss_new by rank_ss;
s4 = foreach s3 generate DaysBetween(ss_new::t1,s1_new::t1) as time_diff;

DUMP s4;