Hadoop Pig 有序分析函数
Hadoop Pig Ordered Analytical Functions
我是 Pig 的新手,想使用有序的分析函数,类似于 SQL。
我的数据看起来像这样:
(stock_symbol,date,stock_price_open,stock_price_close)
(TAC,2001-08-06,16.39,16.36)
(TAC,2001-08-07,16.3,16.54)
(TAC,2001-08-08,16.55,16.44)
(TAC,2001-08-09,16.45,16.48)
(TAC,2001-08-10,16.5,15.8)
我想做的是找出开盘价每天的变化。所以,我的输出看起来像这样:
(stock_symbol,date,stock_price_open,stock_price_close,stock_price_change)
(TAC,2001-08-06,16.39,16.36,NULL)
(TAC,2001-08-07,16.3,16.54,-0.09)
(TAC,2001-08-08,16.55,16.44,0.25)
(TAC,2001-08-09,16.45,16.48,-0.1)
(TAC,2001-08-10,16.5,15.8,0.05)
我希望 Pig 能够查看当前行前面或后面的一行。这可能吗,还是 Pig 不允许进行此类分析?
您可以使用以下脚本获得预期的输出,但可能需要进行一些微调。
A = load '/tmp/pig/test/test' using PigStorage (',');
B= foreach A generate [=10=] as stock_symbol, ToDate(,'yyyy-mm-dd') as dt,(double) as stock_price_open, (double) as stock_price_close,'PT24H' as dthour;
C= foreach B generate [=10=] as stock_symbol, as dt_curr, SubtractDuration(,) as dt_old, as stock_price_open, as stock_price_close;
START = FILTER C BY ( == );
D = JOIN C by [=10=] , START by [=10=];
Filter_D = FILTER D by ((DaysBetween(,)==1) and (DaysBetween(,)==1));
E = foreach Filter_D generate [=10=] as stock_symbol, as dt, as stock_price_open, as stock_price_close, - as stock_price_change;
输出为:
(TAC,2001-01-07T00:08:00.000-08:00,16.3,16.54,-0.08999999999999986)
(TAC,2001-01-08T00:08:00.000-08:00,16.55,16.44,0.25)
(TAC,2001-01-09T00:08:00.000-08:00,16.45,16.48,-0.10000000000000142)
(TAC,2001-01-10T00:08:00.000-08:00,16.5,15.8,0.05000000000000071)
由于我们需要计算早一天的开放日期,因此采用变量“PT24H”,它定义了 Pig 中的 24 小时。
通过使用 ToDate() 和 SubtractDuration() 在下一个操作中打印 Same,然后使用 Join 和 DaysBetween() 操作来获取差异。
ToDate()、SubtractDuration()、DaysBetween() 是 PIG UDF 中的内置函数,您可以编写合适的 UDF 以微调相同的脚本,并采取更适当的操作。
我是 Pig 的新手,想使用有序的分析函数,类似于 SQL。
我的数据看起来像这样:
(stock_symbol,date,stock_price_open,stock_price_close)
(TAC,2001-08-06,16.39,16.36)
(TAC,2001-08-07,16.3,16.54)
(TAC,2001-08-08,16.55,16.44)
(TAC,2001-08-09,16.45,16.48)
(TAC,2001-08-10,16.5,15.8)
我想做的是找出开盘价每天的变化。所以,我的输出看起来像这样:
(stock_symbol,date,stock_price_open,stock_price_close,stock_price_change)
(TAC,2001-08-06,16.39,16.36,NULL)
(TAC,2001-08-07,16.3,16.54,-0.09)
(TAC,2001-08-08,16.55,16.44,0.25)
(TAC,2001-08-09,16.45,16.48,-0.1)
(TAC,2001-08-10,16.5,15.8,0.05)
我希望 Pig 能够查看当前行前面或后面的一行。这可能吗,还是 Pig 不允许进行此类分析?
您可以使用以下脚本获得预期的输出,但可能需要进行一些微调。
A = load '/tmp/pig/test/test' using PigStorage (',');
B= foreach A generate [=10=] as stock_symbol, ToDate(,'yyyy-mm-dd') as dt,(double) as stock_price_open, (double) as stock_price_close,'PT24H' as dthour;
C= foreach B generate [=10=] as stock_symbol, as dt_curr, SubtractDuration(,) as dt_old, as stock_price_open, as stock_price_close;
START = FILTER C BY ( == );
D = JOIN C by [=10=] , START by [=10=];
Filter_D = FILTER D by ((DaysBetween(,)==1) and (DaysBetween(,)==1));
E = foreach Filter_D generate [=10=] as stock_symbol, as dt, as stock_price_open, as stock_price_close, - as stock_price_change;
输出为:
(TAC,2001-01-07T00:08:00.000-08:00,16.3,16.54,-0.08999999999999986)
(TAC,2001-01-08T00:08:00.000-08:00,16.55,16.44,0.25)
(TAC,2001-01-09T00:08:00.000-08:00,16.45,16.48,-0.10000000000000142)
(TAC,2001-01-10T00:08:00.000-08:00,16.5,15.8,0.05000000000000071)
由于我们需要计算早一天的开放日期,因此采用变量“PT24H”,它定义了 Pig 中的 24 小时。 通过使用 ToDate() 和 SubtractDuration() 在下一个操作中打印 Same,然后使用 Join 和 DaysBetween() 操作来获取差异。
ToDate()、SubtractDuration()、DaysBetween() 是 PIG UDF 中的内置函数,您可以编写合适的 UDF 以微调相同的脚本,并采取更适当的操作。