Apache pig 按功能分组没有给出预期的输出
Apache pig group by function is not giving expected output
我有 csv
格式的数据,如下所示。
数据格式如下
"first_name","last_name","company_name","address","city","county","postal","phone1","phone2","email","web"
User.csv
下命名的示例数据。该文件包含以下数据。
"Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk"
"Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk"
"France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk"
当我尝试使用 PigStorage
加载时
user = LOAD '/home/abhijit/Downloads/User.csv' USING PigStorage(',');
DUMP user;
它的输出是这样的:
("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk")
("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk")
("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk")
我想做一个城市的群。所以我写了
grp = group user by ;
dump grp;
我得到的输出为:
( Binney St",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk")})
("8 Moor Place",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk")})
("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk")})
company_name 和地址造成了问题,因为它包含 ','
作为其中的一部分。例如地址中的 "14, Taylor St"
或 company_name.
中的 "Elliott, John W Esq"
所以我的 </code> 被视为 <code>"Taylor St"
而不是 "St. Stephens Ward"
因此,由于地址数据中的额外分隔符或 company_name 数据未正确加载或未正确分隔,并且按功能分组未给出正确的结果。
如何实现分组输出如下
("Abbey Ward",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk")})
("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk")})
("East Southbourne and Tuckton W",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk")})
grp = group a by ;
这不是我的解决方案。我已经想到了。
问题是 PigStorage
没有考虑转义,因此为不应该是列的字段创建列(每次条目包含逗号)。
使用 CSVExcelStorage
将解决此问题,因为此存储可以处理转义,从而创建正确数量和顺序的列。
我有 csv
格式的数据,如下所示。
数据格式如下
"first_name","last_name","company_name","address","city","county","postal","phone1","phone2","email","web"
User.csv
下命名的示例数据。该文件包含以下数据。
"Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk"
"Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk"
"France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk"
当我尝试使用 PigStorage
user = LOAD '/home/abhijit/Downloads/User.csv' USING PigStorage(',');
DUMP user;
它的输出是这样的:
("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk")
("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk")
("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk")
我想做一个城市的群。所以我写了
grp = group user by ;
dump grp;
我得到的输出为:
( Binney St",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk")})
("8 Moor Place",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk")})
("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk")})
company_name 和地址造成了问题,因为它包含 ','
作为其中的一部分。例如地址中的 "14, Taylor St"
或 company_name.
"Elliott, John W Esq"
所以我的 </code> 被视为 <code>"Taylor St"
而不是 "St. Stephens Ward"
因此,由于地址数据中的额外分隔符或 company_name 数据未正确加载或未正确分隔,并且按功能分组未给出正确的结果。
如何实现分组输出如下
("Abbey Ward",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk")})
("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk")})
("East Southbourne and Tuckton W",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk")})
grp = group a by ;
这不是我的解决方案。我已经想到了。
问题是 PigStorage
没有考虑转义,因此为不应该是列的字段创建列(每次条目包含逗号)。
使用 CSVExcelStorage
将解决此问题,因为此存储可以处理转义,从而创建正确数量和顺序的列。