Apache Pig - 其他字符串中的子字符串

Apache Pig - Substring in other string

目前我在 Pig 工作,我正在尝试检查另一个字段(也是 chararray)中是否存在一个字段值(chararray)。 这是一个例子。

文件t.txt:

1;This is a banana which is yellow.;Fruit;Banana
2;This is not about fruit but about Apple Inc.;Company;Apple

在上面的示例中,我想检查最后一个字段(即 BananaApple)是否出现在第二个字段(句子)中。到目前为止,这是我的猪脚本:

a = LOAD 't.txt' using PigStorage(';') AS (id:chararray, sentence:chararray, kind:chararray, search:chararray);

b = FOREACH a GENERATE id, LOWER(sentence) as sent:chararray, kind, LOWER(search) as srch:chararray;

c = FILTER b BY sent MATCHES '.* srch .*';

我要实现的目标是获取围绕搜索词的双字母组。举一个具体的例子,这就是我正在寻找的(或另一种形式):

(1,Fruit,{(a, banana),(banana, which})
(2,Company,{(about, apple),(apple, inc.})

所以,我的问题是:如何使用模式中的字段搜索来匹配模式中的字段语句?

使用 UDF。将句子和搜索项传递给 UDF.In UDF 将句子拆分为单词并迭代 words.If 单词匹配,然后获取搜索项前后的单词。

PigScript

REGISTER GetSurroundingWords.jar;
DEFINE GetSurroundingWords com.mypackages.GetSurroundingWords();

A = LOAD 'test11.txt' using PigStorage(';') AS (id:chararray, sentence:chararray, kind:chararray, search:chararray);
B = FOREACH A GENERATE id, LOWER(sentence) as sent:chararray, kind, LOWER(search) as srch:chararray;
C = FOREACH B GENERATE id,kind,GetSurroundingWords(sent,srch);
DUMP C;

输出

Java UDF

package com.mypackages;

import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class GetSurroundingWords extends EvalFunc<String> 
{
    public String exec(Tuple input) throws IOException
    {
        if(input != null && input.size() != 0)
        {
            String sInputString = input.toString();
            String sOutputString = "";
            try
            {
                if(sInputString != null && !sInputString.isEmpty())
                {
                    String [] sInputStringItems = sInputString.split(",");
                    String sSentence = sInputStringItems[0].replace('(', ' ').trim();
                    String [] sWords = sSentence.split(" ");
                    String sSearchItem = sInputStringItems[1].replace(')',' ').trim();

                    for(int iIndex = 0;iIndex < sWords.length;iIndex ++)
                    {
                        if(sWords[iIndex].equals(sSearchItem))
                        {
                            try
                            {
                                sOutputString = "(" + sWords[--iIndex] + "," + sSearchItem + ")";
                            }catch(Exception ex)
                            {
                                sOutputString = "(" + sSearchItem + ")";
                            }

                            int iNextItem = iIndex + 2;
                            try
                            {
                                sOutputString =  sOutputString + "," + "(" + sSearchItem + "," + sWords[iNextItem] + ")"; 
                            }catch(Exception ex)
                            {
                                sOutputString = sOutputString + "," + "(" + sSearchItem  + ")";
                            }
                            return sOutputString;
                        }
                    }           
                }
                else
                { 
                    return null;
                }   
            }
            catch(Exception ex)
            {  
                return null;
            }
            return sOutputString;
        }
        else
        {
            return null;
        }
    }
}