Apache Pig - 其他字符串中的子字符串
Apache Pig - Substring in other string
目前我在 Pig 工作,我正在尝试检查另一个字段(也是 chararray
)中是否存在一个字段值(chararray
)。
这是一个例子。
文件t.txt
:
1;This is a banana which is yellow.;Fruit;Banana
2;This is not about fruit but about Apple Inc.;Company;Apple
在上面的示例中,我想检查最后一个字段(即 Banana
和 Apple
)是否出现在第二个字段(句子)中。到目前为止,这是我的猪脚本:
a = LOAD 't.txt' using PigStorage(';') AS (id:chararray, sentence:chararray, kind:chararray, search:chararray);
b = FOREACH a GENERATE id, LOWER(sentence) as sent:chararray, kind, LOWER(search) as srch:chararray;
c = FILTER b BY sent MATCHES '.* srch .*';
我要实现的目标是获取围绕搜索词的双字母组。举一个具体的例子,这就是我正在寻找的(或另一种形式):
(1,Fruit,{(a, banana),(banana, which})
(2,Company,{(about, apple),(apple, inc.})
所以,我的问题是:如何使用模式中的字段搜索来匹配模式中的字段语句?
使用 UDF。将句子和搜索项传递给 UDF.In UDF 将句子拆分为单词并迭代 words.If 单词匹配,然后获取搜索项前后的单词。
PigScript
REGISTER GetSurroundingWords.jar;
DEFINE GetSurroundingWords com.mypackages.GetSurroundingWords();
A = LOAD 'test11.txt' using PigStorage(';') AS (id:chararray, sentence:chararray, kind:chararray, search:chararray);
B = FOREACH A GENERATE id, LOWER(sentence) as sent:chararray, kind, LOWER(search) as srch:chararray;
C = FOREACH B GENERATE id,kind,GetSurroundingWords(sent,srch);
DUMP C;
输出
Java UDF
package com.mypackages;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class GetSurroundingWords extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException
{
if(input != null && input.size() != 0)
{
String sInputString = input.toString();
String sOutputString = "";
try
{
if(sInputString != null && !sInputString.isEmpty())
{
String [] sInputStringItems = sInputString.split(",");
String sSentence = sInputStringItems[0].replace('(', ' ').trim();
String [] sWords = sSentence.split(" ");
String sSearchItem = sInputStringItems[1].replace(')',' ').trim();
for(int iIndex = 0;iIndex < sWords.length;iIndex ++)
{
if(sWords[iIndex].equals(sSearchItem))
{
try
{
sOutputString = "(" + sWords[--iIndex] + "," + sSearchItem + ")";
}catch(Exception ex)
{
sOutputString = "(" + sSearchItem + ")";
}
int iNextItem = iIndex + 2;
try
{
sOutputString = sOutputString + "," + "(" + sSearchItem + "," + sWords[iNextItem] + ")";
}catch(Exception ex)
{
sOutputString = sOutputString + "," + "(" + sSearchItem + ")";
}
return sOutputString;
}
}
}
else
{
return null;
}
}
catch(Exception ex)
{
return null;
}
return sOutputString;
}
else
{
return null;
}
}
}
目前我在 Pig 工作,我正在尝试检查另一个字段(也是 chararray
)中是否存在一个字段值(chararray
)。
这是一个例子。
文件t.txt
:
1;This is a banana which is yellow.;Fruit;Banana
2;This is not about fruit but about Apple Inc.;Company;Apple
在上面的示例中,我想检查最后一个字段(即 Banana
和 Apple
)是否出现在第二个字段(句子)中。到目前为止,这是我的猪脚本:
a = LOAD 't.txt' using PigStorage(';') AS (id:chararray, sentence:chararray, kind:chararray, search:chararray);
b = FOREACH a GENERATE id, LOWER(sentence) as sent:chararray, kind, LOWER(search) as srch:chararray;
c = FILTER b BY sent MATCHES '.* srch .*';
我要实现的目标是获取围绕搜索词的双字母组。举一个具体的例子,这就是我正在寻找的(或另一种形式):
(1,Fruit,{(a, banana),(banana, which})
(2,Company,{(about, apple),(apple, inc.})
所以,我的问题是:如何使用模式中的字段搜索来匹配模式中的字段语句?
使用 UDF。将句子和搜索项传递给 UDF.In UDF 将句子拆分为单词并迭代 words.If 单词匹配,然后获取搜索项前后的单词。
PigScript
REGISTER GetSurroundingWords.jar;
DEFINE GetSurroundingWords com.mypackages.GetSurroundingWords();
A = LOAD 'test11.txt' using PigStorage(';') AS (id:chararray, sentence:chararray, kind:chararray, search:chararray);
B = FOREACH A GENERATE id, LOWER(sentence) as sent:chararray, kind, LOWER(search) as srch:chararray;
C = FOREACH B GENERATE id,kind,GetSurroundingWords(sent,srch);
DUMP C;
输出
Java UDF
package com.mypackages;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class GetSurroundingWords extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException
{
if(input != null && input.size() != 0)
{
String sInputString = input.toString();
String sOutputString = "";
try
{
if(sInputString != null && !sInputString.isEmpty())
{
String [] sInputStringItems = sInputString.split(",");
String sSentence = sInputStringItems[0].replace('(', ' ').trim();
String [] sWords = sSentence.split(" ");
String sSearchItem = sInputStringItems[1].replace(')',' ').trim();
for(int iIndex = 0;iIndex < sWords.length;iIndex ++)
{
if(sWords[iIndex].equals(sSearchItem))
{
try
{
sOutputString = "(" + sWords[--iIndex] + "," + sSearchItem + ")";
}catch(Exception ex)
{
sOutputString = "(" + sSearchItem + ")";
}
int iNextItem = iIndex + 2;
try
{
sOutputString = sOutputString + "," + "(" + sSearchItem + "," + sWords[iNextItem] + ")";
}catch(Exception ex)
{
sOutputString = sOutputString + "," + "(" + sSearchItem + ")";
}
return sOutputString;
}
}
}
else
{
return null;
}
}
catch(Exception ex)
{
return null;
}
return sOutputString;
}
else
{
return null;
}
}
}