根据 Hive 中的空格拆分字符串
Split a string based on spacein Hive
这是我的 CSV 文件的格式:
Chevrolet C10,13.0,8,350.0,145.0,4055,12.0,76,US
Ford F108,13.0,8,302.0,130.0,3870,15.0,76,US
Dodge D100,13.0,8,318.0,150.0,3755,14.0,76,US
Honda Accord CVCC,31.5,4,98.00,68.00,2045,18.5,77,Japan
Buick Opel Isuzu Deluxe,30.0,4,111.0,80.00,2155,14.8,77,US
Renault 5 GTL,36.0,4,79.00,58.00,1825,18.6,77,Europe
Plymouth Arrow GS,25.5,4,122.0,96.00,2300,15.5,77,US
我想像这样拆分第一个字段,
雪佛兰C10应该是雪佛兰
福特F108应该是福特
Honda Accord CVCC 应该是Honda etc 然后我会用车名做进一步处理
select
case when MODEL like 'US % %' or MODEL like 'Europe % %'
then regexp_extract(MODEL, '^([^ ]* [^ ]*) ', 1)
when MODEL like '% %'
then regexp_extract(MODEL, '^([^ ]*) ', 1)
else MODEL
end as BRAND
from WHATEVER
- 雪佛兰 C10 => 雪佛兰
- 美国本田雅阁 => 美国本田
- Zorglub => Zorglub
使用下面的 UDF -
substring_index(string A, string delim, int count)
Pig 中的解决方案
代码:
read = LOAD 'test.data' USING PigStorage(',') AS (name:chararray, val1:long, val2:long, val3:long, val4:long, val5:long, val6:long, country:chararray);
sub_data = FOREACH read GENERATE SUBSTRING(name,0,(INDEXOF(name, ' ',0))) AS (subname:chararray);
DUMP sub_data;
输出:
(Chevrolet)
(Ford)
(Dodge)
(Honda)
(Buick)
(Renault)
(Plymouth)
使用您想要的 table 架构创建一个 table。
CREATE TABLE carinfo (carname STRING, val1 DOUBLE, val2 INT, val3 DOUBLE, val4 DOUBLE, val5 INT, val6 DOUBLE, val7 INT, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
加载数据到上面table
LOAD DATA LOCAL INPATH '/hivesamples/splitstr.txt' OVERWRITE INTO TABLE carinfo;
使用CTAS
拆分carname得到品牌名。这个新的 table 将具有您之前定义的相同架构。
CREATE TABLE modified_carinfo
AS
SELECT split(carname, ' ')[0] as carname, val1, val2, val3, val4, val5 ,val6, val7, country
FROM carinfo;
这是我的 CSV 文件的格式:
Chevrolet C10,13.0,8,350.0,145.0,4055,12.0,76,US
Ford F108,13.0,8,302.0,130.0,3870,15.0,76,US
Dodge D100,13.0,8,318.0,150.0,3755,14.0,76,US
Honda Accord CVCC,31.5,4,98.00,68.00,2045,18.5,77,Japan
Buick Opel Isuzu Deluxe,30.0,4,111.0,80.00,2155,14.8,77,US
Renault 5 GTL,36.0,4,79.00,58.00,1825,18.6,77,Europe
Plymouth Arrow GS,25.5,4,122.0,96.00,2300,15.5,77,US
我想像这样拆分第一个字段, 雪佛兰C10应该是雪佛兰 福特F108应该是福特 Honda Accord CVCC 应该是Honda etc 然后我会用车名做进一步处理
select
case when MODEL like 'US % %' or MODEL like 'Europe % %'
then regexp_extract(MODEL, '^([^ ]* [^ ]*) ', 1)
when MODEL like '% %'
then regexp_extract(MODEL, '^([^ ]*) ', 1)
else MODEL
end as BRAND
from WHATEVER
- 雪佛兰 C10 => 雪佛兰
- 美国本田雅阁 => 美国本田
- Zorglub => Zorglub
使用下面的 UDF -
substring_index(string A, string delim, int count)
Pig 中的解决方案
代码:
read = LOAD 'test.data' USING PigStorage(',') AS (name:chararray, val1:long, val2:long, val3:long, val4:long, val5:long, val6:long, country:chararray);
sub_data = FOREACH read GENERATE SUBSTRING(name,0,(INDEXOF(name, ' ',0))) AS (subname:chararray);
DUMP sub_data;
输出:
(Chevrolet)
(Ford)
(Dodge)
(Honda)
(Buick)
(Renault)
(Plymouth)
使用您想要的 table 架构创建一个 table。
CREATE TABLE carinfo (carname STRING, val1 DOUBLE, val2 INT, val3 DOUBLE, val4 DOUBLE, val5 INT, val6 DOUBLE, val7 INT, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
加载数据到上面table
LOAD DATA LOCAL INPATH '/hivesamples/splitstr.txt' OVERWRITE INTO TABLE carinfo;
使用CTAS
拆分carname得到品牌名。这个新的 table 将具有您之前定义的相同架构。
CREATE TABLE modified_carinfo
AS
SELECT split(carname, ' ')[0] as carname, val1, val2, val3, val4, val5 ,val6, val7, country
FROM carinfo;