如何在 Pig Latin 中生成大量数据的不同平均值?
How to generate a distinct average of lots of data in Pig Latin?
我有一个大型出租列表数据集,我想根据卧室数量生成每个城市的平均价格。我有以下类型的行:
{( city: 'New York', num_bedrooms: 1, price: 1000.00 ),
( city: 'New York', num_bedrooms: 2, price: 2000.00 ),
( city: 'New York', num_bedrooms: 1, price: 2000.00 ),
( city: 'Chicago', num_bedrooms: 1, price: 4000.00 ),
( city: 'Chicago', num_bedrooms: 1, price: 1500.00 )}
使用 Pig,我想获得以下格式的结果:
{( city: 'New York', 1: 1500.00, 2: 2000.00),
( city: 'Chicago', 1: 2750.00 )}
或者,我也可以处理这个:
{( city: 'New York', num_bedrooms: 1, price: 1500.00),
( city: 'New York', num_bedrooms: 2, price: 2000.00),
( city: 'Chicago', num_bedrooms: 1, price: 2750.00 )}
我的计划是使用此数据创建条形图,X 轴为卧室数量,Y 轴为给定城市的价格。我已经能够按城市和卧室数量分组,然后取平均值,但我不知道如何将数据放入我想要的格式。到目前为止,这就是我所拥有的:
D = GROUP blah BY (city, num_bedrooms);
C = FOREACH D GENERATE blah.city, blah.num_bedrooms, AVG(blah.price);
然而,这会导致城市和 num_bedrooms 每次出现时都会重复出现!
输入:
New York,1,1000.00
New York,2,2000.00
New York,1,2000.00
Chicago,1,4000.00
Chicago,1,1500.00
方法一:
猪脚本:
rental_data = LOAD 'rental_data.csv' USING PigStorage(',') AS (city:chararray, num_bedrooms: long, price:double);
rental_data_grp_city = GROUP rental_data BY (city);
rental_kpi = FOREACH rental_data_grp_city {
one_bed_room = FILTER rental_data BY num_bedrooms==1;
two_bed_room = FILTER rental_data BY num_bedrooms==2;
GENERATE group AS city, AVG(one_bed_room.price) AS one_bed_price, AVG(two_bed_room.price) AS tow_bed_price;
};
输出:转储rental_kpi:
(Chicago,2750.0,)
(New York,1500.0,2000.0)
方法二:
猪脚本:
rental_data = LOAD 'rental_data.csv' USING PigStorage(',') AS (city:chararray, num_bedrooms: long, price:double);
rental_data_grp_city = GROUP rental_data BY (city,num_bedrooms);
rental_kpi = FOREACH rental_data_grp_city {
prices_bag = rental_data.price;
GENERATE group.city AS city, group.num_bedrooms AS num_bedrooms, AVG(prices_bag) AS price;
}
输出:转储rental_kpi:
(Chicago,1,2750.0)
(New York,2,2000.0)
(New York,1,1500.0)
我有一个大型出租列表数据集,我想根据卧室数量生成每个城市的平均价格。我有以下类型的行:
{( city: 'New York', num_bedrooms: 1, price: 1000.00 ),
( city: 'New York', num_bedrooms: 2, price: 2000.00 ),
( city: 'New York', num_bedrooms: 1, price: 2000.00 ),
( city: 'Chicago', num_bedrooms: 1, price: 4000.00 ),
( city: 'Chicago', num_bedrooms: 1, price: 1500.00 )}
使用 Pig,我想获得以下格式的结果:
{( city: 'New York', 1: 1500.00, 2: 2000.00),
( city: 'Chicago', 1: 2750.00 )}
或者,我也可以处理这个:
{( city: 'New York', num_bedrooms: 1, price: 1500.00),
( city: 'New York', num_bedrooms: 2, price: 2000.00),
( city: 'Chicago', num_bedrooms: 1, price: 2750.00 )}
我的计划是使用此数据创建条形图,X 轴为卧室数量,Y 轴为给定城市的价格。我已经能够按城市和卧室数量分组,然后取平均值,但我不知道如何将数据放入我想要的格式。到目前为止,这就是我所拥有的:
D = GROUP blah BY (city, num_bedrooms);
C = FOREACH D GENERATE blah.city, blah.num_bedrooms, AVG(blah.price);
然而,这会导致城市和 num_bedrooms 每次出现时都会重复出现!
输入:
New York,1,1000.00
New York,2,2000.00
New York,1,2000.00
Chicago,1,4000.00
Chicago,1,1500.00
方法一:
猪脚本:
rental_data = LOAD 'rental_data.csv' USING PigStorage(',') AS (city:chararray, num_bedrooms: long, price:double);
rental_data_grp_city = GROUP rental_data BY (city);
rental_kpi = FOREACH rental_data_grp_city {
one_bed_room = FILTER rental_data BY num_bedrooms==1;
two_bed_room = FILTER rental_data BY num_bedrooms==2;
GENERATE group AS city, AVG(one_bed_room.price) AS one_bed_price, AVG(two_bed_room.price) AS tow_bed_price;
};
输出:转储rental_kpi:
(Chicago,2750.0,)
(New York,1500.0,2000.0)
方法二:
猪脚本:
rental_data = LOAD 'rental_data.csv' USING PigStorage(',') AS (city:chararray, num_bedrooms: long, price:double);
rental_data_grp_city = GROUP rental_data BY (city,num_bedrooms);
rental_kpi = FOREACH rental_data_grp_city {
prices_bag = rental_data.price;
GENERATE group.city AS city, group.num_bedrooms AS num_bedrooms, AVG(prices_bag) AS price;
}
输出:转储rental_kpi:
(Chicago,1,2750.0)
(New York,2,2000.0)
(New York,1,1500.0)