如何在 impala 中将 csv 加载到外部 table 时删除双引号?
How to remove double quotes when loading csv into external table in impala?
这是数据(也可以从here下载):
"Creation Date","Status","First 3 Chars of Postal Code","Intersection Street 1","Intersection Street 2","Ward","Service Request Type","Division","Section"
"2010-01-01 00:38:26.0000000","Closed","Intersection","High Park Blvd","Parkside Dr","Parkdale-High Park (13)","Road - Sanding / Salting Required","Transportation Services","Road Operations"
"2010-01-01 01:19:18.0000000","Closed","M4T","","","Toronto Centre-Rosedale (27)","Water Service Line-Turn On","Toronto Water","District Ops"
这是我创建的 table 查询:
CREATE TABLE sr.sr2013 (
creation_date STRING,
status STRING,
first_3_chars_of_postal_code STRING,
intersection_street_1 STRING,
intersection_street_2 STRING,
ward STRING,
service_request_type STRING,
division STRING,
section STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
WITH SERDEPROPERTIES (
'colelction.delim'='\u0002',
'mapkey.delim'='\u0003',
'serialization.format'=',',
'field.delim'=',',
'skip.header.line.count'='1',
'quoteChar'= "\"") ;
这是加载数据查询:
load data inpath '/user/rxie/SR2013.csv' into table sr2013;
加载数据后,检查table发现所有原始引号都被保留:
所以这里至少有两个问题:
1、table创建中的选项'skip.header.line.count'='1',
不排除表头;
2. 将数据加载到 table
时,如选项 'quoteChar'= "\""
所示,双引号未被删除
谁能分享更多的光?对我来说它看起来像错误。
更新 1:
在Hue/Hive编辑中:
creation_date STRING,
status STRING,
first_3_chars_of_postal_code STRING,
intersection_street_1 STRING,
intersection_street_2 STRING,
ward STRING,
service_request_type STRING,
division STRING,
section STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'colelction.delim'='\u0002',
'field.delim'=',',
'mapkey.delim'='\u0003',
'serialization.format'=',',
'skip.header.line.count'='1',
'quoteChar'= "\"")
LOAD DATA LOCAL INPATH '/home/rxie/data/csv/SR2015.csv' INTO TABLE sr2015;
错误:
Error while compiling statement: FAILED: SemanticException line 1:26
Invalid path ''/home/rxie/data/csv/SR2015.csv'': No files matching
path file:/home/rxie/data/csv/SR2015.csv
下面是我加载 csv 时排除引号的方法如下:
在 Hive Editor 中(我认为 beeline 也不错,虽然我没有测试它):
创建蜂巢table
创建外部 TABLE sr2015(
creation_date 字符串,
状态字符串,
first_3_chars_of_postal_code 字符串,
intersection_street_1 字符串,
intersection_street_2 字符串,
病房 STRING,
service_request_type 字符串,
除法 STRING,
部分字符串)
行格式 SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
使用 SERDEPROPERTIES(
'colelction.delim'='\u0002',
'field.delim'=',',
'mapkey.delim'='\u0003',
'serialization.format'=',',
'skip.header.line.count'='1',
'quoteChar'= "\"")
将数据加载到 Hive table:
加载数据路径 "hdfs:///user/rxie/SR2015.csv" 进入 TABLE sr2015;
未决问题(将讨论here):
在 Impala
中无法访问 table
这是数据(也可以从here下载):
"Creation Date","Status","First 3 Chars of Postal Code","Intersection Street 1","Intersection Street 2","Ward","Service Request Type","Division","Section"
"2010-01-01 00:38:26.0000000","Closed","Intersection","High Park Blvd","Parkside Dr","Parkdale-High Park (13)","Road - Sanding / Salting Required","Transportation Services","Road Operations"
"2010-01-01 01:19:18.0000000","Closed","M4T","","","Toronto Centre-Rosedale (27)","Water Service Line-Turn On","Toronto Water","District Ops"
这是我创建的 table 查询:
CREATE TABLE sr.sr2013 (
creation_date STRING,
status STRING,
first_3_chars_of_postal_code STRING,
intersection_street_1 STRING,
intersection_street_2 STRING,
ward STRING,
service_request_type STRING,
division STRING,
section STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
WITH SERDEPROPERTIES (
'colelction.delim'='\u0002',
'mapkey.delim'='\u0003',
'serialization.format'=',',
'field.delim'=',',
'skip.header.line.count'='1',
'quoteChar'= "\"") ;
这是加载数据查询:
load data inpath '/user/rxie/SR2013.csv' into table sr2013;
加载数据后,检查table发现所有原始引号都被保留:
所以这里至少有两个问题:
1、table创建中的选项'skip.header.line.count'='1',
不排除表头;
2. 将数据加载到 table
'quoteChar'= "\""
所示,双引号未被删除
谁能分享更多的光?对我来说它看起来像错误。
更新 1:
在Hue/Hive编辑中:
creation_date STRING,
status STRING,
first_3_chars_of_postal_code STRING,
intersection_street_1 STRING,
intersection_street_2 STRING,
ward STRING,
service_request_type STRING,
division STRING,
section STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'colelction.delim'='\u0002',
'field.delim'=',',
'mapkey.delim'='\u0003',
'serialization.format'=',',
'skip.header.line.count'='1',
'quoteChar'= "\"")
LOAD DATA LOCAL INPATH '/home/rxie/data/csv/SR2015.csv' INTO TABLE sr2015;
错误:
Error while compiling statement: FAILED: SemanticException line 1:26 Invalid path ''/home/rxie/data/csv/SR2015.csv'': No files matching path file:/home/rxie/data/csv/SR2015.csv
下面是我加载 csv 时排除引号的方法如下:
在 Hive Editor 中(我认为 beeline 也不错,虽然我没有测试它):
创建蜂巢table
创建外部 TABLE sr2015(
creation_date 字符串,
状态字符串,
first_3_chars_of_postal_code 字符串,
intersection_street_1 字符串,
intersection_street_2 字符串,
病房 STRING,
service_request_type 字符串,
除法 STRING,
部分字符串)
行格式 SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 使用 SERDEPROPERTIES(
'colelction.delim'='\u0002',
'field.delim'=',',
'mapkey.delim'='\u0003',
'serialization.format'=',', 'skip.header.line.count'='1',
'quoteChar'= "\"")将数据加载到 Hive table:
加载数据路径 "hdfs:///user/rxie/SR2015.csv" 进入 TABLE sr2015;
未决问题(将讨论here): 在 Impala
中无法访问 table