根据 Hive 中的空格拆分字符串

Split a string based on spacein Hive

这是我的 CSV 文件的格式:

Chevrolet C10,13.0,8,350.0,145.0,4055,12.0,76,US
Ford F108,13.0,8,302.0,130.0,3870,15.0,76,US
Dodge D100,13.0,8,318.0,150.0,3755,14.0,76,US
Honda Accord CVCC,31.5,4,98.00,68.00,2045,18.5,77,Japan
Buick Opel Isuzu Deluxe,30.0,4,111.0,80.00,2155,14.8,77,US
Renault 5 GTL,36.0,4,79.00,58.00,1825,18.6,77,Europe
Plymouth Arrow GS,25.5,4,122.0,96.00,2300,15.5,77,US

我想像这样拆分第一个字段, 雪佛兰C10应该是雪佛兰 福特F108应该是福特 Honda Accord CVCC 应该是Honda etc 然后我会用车名做进一步处理

select
  case when MODEL like 'US % %' or MODEL like 'Europe % %'
        then regexp_extract(MODEL, '^([^ ]* [^ ]*) ', 1)
        when MODEL like '% %'
        then regexp_extract(MODEL, '^([^ ]*) ', 1)
        else MODEL
  end as BRAND
from WHATEVER
  • 雪佛兰 C10 => 雪佛兰
  • 美国本田雅阁 => 美国本田
  • Zorglub => Zorglub

使用下面的 UDF -

substring_index(string A, string delim, int count)

Reference

Pig 中的解决方案

代码:

read = LOAD 'test.data' USING PigStorage(',') AS (name:chararray, val1:long, val2:long, val3:long, val4:long, val5:long, val6:long, country:chararray);
sub_data = FOREACH read GENERATE SUBSTRING(name,0,(INDEXOF(name, ' ',0)))  AS (subname:chararray);
DUMP sub_data;

输出:

(Chevrolet)
(Ford)
(Dodge)
(Honda)
(Buick)
(Renault)
(Plymouth)

使用您想要的 table 架构创建一个 table。

CREATE TABLE carinfo (carname STRING, val1 DOUBLE, val2 INT, val3 DOUBLE, val4 DOUBLE, val5 INT, val6 DOUBLE, val7 INT, country STRING) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',';

加载数据到上面table

LOAD DATA LOCAL INPATH '/hivesamples/splitstr.txt' OVERWRITE INTO TABLE carinfo;

使用CTAS拆分carname得到品牌名。这个新的 table 将具有您之前定义的相同架构。

CREATE TABLE modified_carinfo 
AS 
SELECT split(carname, ' ')[0] as carname, val1, val2, val3, val4, val5 ,val6, val7, country 
FROM carinfo;