PIG REGEX_EXTRACT_ALL 不工作
PIG REGEX_EXTRACT_ALL is not working
我有以下数据。
• PRT_Edit & Set Shopping Cart in Retail
• PRT_Confirm Shopping Cart for Goods
o PRT-Ret_Process Supplier Invoice
o PRT-Web_Overview of Orders
o PRT_Update Outfirst Agreement
PRT_Axn_-Purchase and Requisition
数据有特殊符号,制表符space和spaces。我只想从此数据中提取文本部分:
PRT_Edit & Set Shopping Cart in Retail
PRT_Confirm Shopping Cart for Goods
PRT-Ret_Process Supplier Invoice
PRT-Web_Overview of Orders
PRT_Update Outfirst Agreement
我试过在 Pig 脚本中使用 REGEX_EXTRACT_ALL,如下所示,但它不起作用。
PRT = LOAD '/DATA' USING TEXTLOADER() AS (LINE:CHARARRAY);
Cleansed = FOREACH PRT GENERATE REGEX_EXTRACT_ALL(LINE,'[A-Z]*') AS DATA;
当我尝试转储 Cleansed 时,它没有显示任何数据。有谁能帮忙吗
您可以使用
Cleansed = FOREACH PRT GENERATE FLATTEN(
REGEX_EXTRACT_ALL(LINE, '^[^a-zA-Z]*([a-zA-Z].*[a-zA-Z])[^a-zA-Z]*$'))
AS (FIELD1:chararray), LINE;
正则表达式匹配以下内容:
^
- 字符串开头
[^a-zA-Z]*
- 字符 class 中除拉丁字母外的 0 个或多个字符
([a-zA-Z].*[a-zA-Z])
- 我们稍后将引用为 FIELD1
的捕获组,匹配:
[a-zA-Z].*[a-zA-Z]
- 一个拉丁字母,然后是任何字符,尽可能多(使用贪婪的*
,而不是*?
懒惰的)
[^a-zA-Z]*
- 0 个或多个非拉丁字母的字符
$
- 字符串结尾
我有以下数据。
• PRT_Edit & Set Shopping Cart in Retail
• PRT_Confirm Shopping Cart for Goods
o PRT-Ret_Process Supplier Invoice
o PRT-Web_Overview of Orders
o PRT_Update Outfirst Agreement
PRT_Axn_-Purchase and Requisition
数据有特殊符号,制表符space和spaces。我只想从此数据中提取文本部分:
PRT_Edit & Set Shopping Cart in Retail
PRT_Confirm Shopping Cart for Goods
PRT-Ret_Process Supplier Invoice
PRT-Web_Overview of Orders
PRT_Update Outfirst Agreement
我试过在 Pig 脚本中使用 REGEX_EXTRACT_ALL,如下所示,但它不起作用。
PRT = LOAD '/DATA' USING TEXTLOADER() AS (LINE:CHARARRAY);
Cleansed = FOREACH PRT GENERATE REGEX_EXTRACT_ALL(LINE,'[A-Z]*') AS DATA;
当我尝试转储 Cleansed 时,它没有显示任何数据。有谁能帮忙吗
您可以使用
Cleansed = FOREACH PRT GENERATE FLATTEN(
REGEX_EXTRACT_ALL(LINE, '^[^a-zA-Z]*([a-zA-Z].*[a-zA-Z])[^a-zA-Z]*$'))
AS (FIELD1:chararray), LINE;
正则表达式匹配以下内容:
^
- 字符串开头[^a-zA-Z]*
- 字符 class 中除拉丁字母外的 0 个或多个字符
([a-zA-Z].*[a-zA-Z])
- 我们稍后将引用为FIELD1
的捕获组,匹配:[a-zA-Z].*[a-zA-Z]
- 一个拉丁字母,然后是任何字符,尽可能多(使用贪婪的*
,而不是*?
懒惰的)
[^a-zA-Z]*
- 0 个或多个非拉丁字母的字符$
- 字符串结尾