如何列出包含特定字符串的元组?
How to list tuple that contain specific string?
Input:
Visit ID Events
101 154,2,135
124 1, 120, 1050,2302
139 200, 150, 1, 320
140 30023, 200
Pig 新手。想知道如何使用 Pig 脚本列出事件中包含“1”的 visitID 行。
谢谢!
我试过的代码:
a = LOAD '/user/a6000518-a/AdobeHourlySampleHit/hit_data.tsv' using PigStorage('\t');
b= foreach a GENERATE REGEX_EXTRACT_ALL(, '(.*,1,.*|1,.*|.*,1)') as post_event_list;
c= FILTER b BY [=11=] is not NULL;
d= DISTINCT c;
dump d;
这仅打印行包含“1”的事件列。如果我使用 visitID 生成,我会得到不正确的结果。我想打印 visitID 以及包含“1”的事件。
我明白了。可能不是有效的编码方式,但可以准确地获得输出。
a = LOAD '/user/a6000518-a/AdobeHourlySampleHit/hit_data.tsv' using PigStorage('\t');
b= FILTER a BY 3 == '0';
c= FILTER b BY 8 != '5' AND 8 != '8' AND 8 != '7' AND 8 != '9';
d= FOREACH c GENERATE CONCAT(CONCAT(CONCAT(0,1),2),8) as (visitID:bytearray), 4 as post_event_list, 6 as post_product_list;
e= FILTER d BY post_event_list != ' ' OR post_event_list != '' OR post_event_list is not NULL;
f= FOREACH e GENERATE REGEX_EXTRACT_ALL(post_event_list, '(.*,1,.*|1,.*|.*,1)') as purchase_event, visitID, post_product_list;
g= FILTER f BY [=10=] is not NULL;
h= dump g;
你可以只写一个 python udf
并查看字符串中是否存在相关字符;可能会让事情变得更简单。
python udf:
#!/usr/bin/python
@outputSchema("flg:int")
def tuple_contains(tup, val):
try:
if val in tup:
return 1
else:
return 0
except:
return 0
脚本:
REGISTER /path/to/jars/tuple_contains.py USING jython AS udf;
data = LOAD 'data' AS (visit_id:chararray, event_list:chararray);
A = FILTER data BY udf.tuple_contains(STRSPLIT(event_list, ','), '1') == 1;
B = FOREACH A GENERATE visit_id, event_list; -- ... other columns
DUMP B;
输出:
124 1,120,1050,2302
139 200,150,1,320
Input:
Visit ID Events
101 154,2,135
124 1, 120, 1050,2302
139 200, 150, 1, 320
140 30023, 200
Pig 新手。想知道如何使用 Pig 脚本列出事件中包含“1”的 visitID 行。
谢谢!
我试过的代码:
a = LOAD '/user/a6000518-a/AdobeHourlySampleHit/hit_data.tsv' using PigStorage('\t');
b= foreach a GENERATE REGEX_EXTRACT_ALL(, '(.*,1,.*|1,.*|.*,1)') as post_event_list;
c= FILTER b BY [=11=] is not NULL;
d= DISTINCT c;
dump d;
这仅打印行包含“1”的事件列。如果我使用 visitID 生成,我会得到不正确的结果。我想打印 visitID 以及包含“1”的事件。
我明白了。可能不是有效的编码方式,但可以准确地获得输出。
a = LOAD '/user/a6000518-a/AdobeHourlySampleHit/hit_data.tsv' using PigStorage('\t');
b= FILTER a BY 3 == '0';
c= FILTER b BY 8 != '5' AND 8 != '8' AND 8 != '7' AND 8 != '9';
d= FOREACH c GENERATE CONCAT(CONCAT(CONCAT(0,1),2),8) as (visitID:bytearray), 4 as post_event_list, 6 as post_product_list;
e= FILTER d BY post_event_list != ' ' OR post_event_list != '' OR post_event_list is not NULL;
f= FOREACH e GENERATE REGEX_EXTRACT_ALL(post_event_list, '(.*,1,.*|1,.*|.*,1)') as purchase_event, visitID, post_product_list;
g= FILTER f BY [=10=] is not NULL;
h= dump g;
你可以只写一个 python udf
并查看字符串中是否存在相关字符;可能会让事情变得更简单。
python udf:
#!/usr/bin/python
@outputSchema("flg:int")
def tuple_contains(tup, val):
try:
if val in tup:
return 1
else:
return 0
except:
return 0
脚本:
REGISTER /path/to/jars/tuple_contains.py USING jython AS udf;
data = LOAD 'data' AS (visit_id:chararray, event_list:chararray);
A = FILTER data BY udf.tuple_contains(STRSPLIT(event_list, ','), '1') == 1;
B = FOREACH A GENERATE visit_id, event_list; -- ... other columns
DUMP B;
输出:
124 1,120,1050,2302
139 200,150,1,320