具有时间间隔的 apache pig 脚本
apache pig script with time interval
我想每小时对每个端口的 RW 列求和
Time ID Name RW
-------- --- ------- ----------
14:57:01 000 Port0 1340
14:57:01 001 Port1 13
14:58:01 000 Port0 864
14:58:01 001 Port1 36
14:59:01 000 Port0 1394
14:59:01 001 Port1 22
15:57:01 000 Port0 1340
15:57:01 001 Port1 13
15:58:01 000 Port0 864
15:58:01 001 Port1 36
15:59:01 000 Port0 1394
15:59:01 001 Port1 22
.
.
.
20:57:01 000 Port0 1340
20:57:01 001 Port1 13
20:58:01 000 Port0 864
20:58:01 001 Port1 36
20:59:01 000 Port0 1394
20:59:01 001 Port1 22
我的剧本是
data = LOAD 'hdfs:/data/data.txt' USING PigStorage(',') AS (time:chararray, id:chararray, name:chararray, read:int, write:int, rw:int);
runs = FOREACH data GENERATE time, name, rw;
如何
您必须从名为 hours 的时间列生成一个新列,然后按小时、端口名称分组,然后获取每个分组的总和。
data = LOAD 'hdfs:/data/data.txt' USING PigStorage(',') AS (time:chararray, id:chararray, name:chararray, read:int, write:int, rw:int);
runs = FOREACH data GENERATE GetHour((timestamp)time) as hour, name, rw;
grouped = GROUP runs by (hour,name);
port_total = FOREACH grouped GENERATE FLATTEN(group) as (hour,name),SUM(data.rw);
DUMP port_total;
我想每小时对每个端口的 RW 列求和
Time ID Name RW
-------- --- ------- ----------
14:57:01 000 Port0 1340
14:57:01 001 Port1 13
14:58:01 000 Port0 864
14:58:01 001 Port1 36
14:59:01 000 Port0 1394
14:59:01 001 Port1 22
15:57:01 000 Port0 1340
15:57:01 001 Port1 13
15:58:01 000 Port0 864
15:58:01 001 Port1 36
15:59:01 000 Port0 1394
15:59:01 001 Port1 22
.
.
.
20:57:01 000 Port0 1340
20:57:01 001 Port1 13
20:58:01 000 Port0 864
20:58:01 001 Port1 36
20:59:01 000 Port0 1394
20:59:01 001 Port1 22
我的剧本是
data = LOAD 'hdfs:/data/data.txt' USING PigStorage(',') AS (time:chararray, id:chararray, name:chararray, read:int, write:int, rw:int);
runs = FOREACH data GENERATE time, name, rw;
如何
您必须从名为 hours 的时间列生成一个新列,然后按小时、端口名称分组,然后获取每个分组的总和。
data = LOAD 'hdfs:/data/data.txt' USING PigStorage(',') AS (time:chararray, id:chararray, name:chararray, read:int, write:int, rw:int);
runs = FOREACH data GENERATE GetHour((timestamp)time) as hour, name, rw;
grouped = GROUP runs by (hour,name);
port_total = FOREACH grouped GENERATE FLATTEN(group) as (hour,name),SUM(data.rw);
DUMP port_total;