如何比较两对rdd
how to compare between two pair rdd
我有两对 RDD r1 和 r2,包含定义为
的元组
Tuple2<Integer,String[]>
我想做的是从两个 RDD 中找到具有相同键的元组,而不是将 r1 的值部分(字符串 [])的每个元素与 r2 的其他元素进行比较,而不是 return它们不同的元素的索引,举个例子,假设 r1 是这样的:
{ (1,["a1","b1","c1"]) (2,["x1","y1","z1"])...}
而 r2 就像:
{ (1,["a2","b2","c2"]) (3,["x2","y2","z2"])...}
如果我们看到这里,键(1)存在于两个RDD中,所以它很关心,现在我想扫描两个RDD中的值部分,并将元素与具有相同索引的元素一一比较在另一个 RDD 中,当我发现相同的元素(在 r1 的元组和 r2 的元组中具有相同的索引)时,我 return 其索引的值,让我们解释一下
this is the tuple that has the key 1 in r1 :
(1,["a1","b1","c1"])
and this is the tuple that has the key 1 in r2 :
(1,["a2","b2","c2"])
扫一扫,我比较"a1"和"a2","b1"和"b2","c1"和"c2"
我假设经过比较我发现:
"a1".equals"a2"=true, "b1".equals"b2"=false, and "c1".equals"c2"=false
知道java中表的索引是从0开始的,正如我之前所说的,我想return不等于的元素的索引,按照这个例子我会return index1=1 和 index2=2,我该怎么做?
Note: if i have to return more than one index, i think it ll be better that i collect them in one RDD of INtegers named
JavaRDD <Integer> indexes
i hope that it s clean, and i ll appreciate any help from your sides, thank you.
您可以使用 join
然后 map
。
JavaPairRDD<Integer,Integer[]> idWithIndexes = r1.join(r2).map(new Function<Tuple2<Integer,Tuple2<String[],String[]>>,Tuple2<Integer,Integer[]>>(){
@Override
public Tuple2<Integer, Integer[]> call(Tuple2<Integer, Tuple2<String[], String[]>> t) throws Exception {
int id = t._1;
String[] s1 = t._2._1;
String[] s2 = t._2._2;
int length = Math.min(s1.length, s2.length);
List<Integer> index = new ArrayList<Integer>();
for (int i = 0; i < length; i++) {
if (!s1[i].equals(s2[i])) {
index.add(i);
}
}
return new Tuple2<Integer,Integer[]>(id, index.toArray(new Integer[0]));
}
});
这个returnsJavaPairRDD
的id和index数组。
我有两对 RDD r1 和 r2,包含定义为
的元组Tuple2<Integer,String[]>
我想做的是从两个 RDD 中找到具有相同键的元组,而不是将 r1 的值部分(字符串 [])的每个元素与 r2 的其他元素进行比较,而不是 return它们不同的元素的索引,举个例子,假设 r1 是这样的:
{ (1,["a1","b1","c1"]) (2,["x1","y1","z1"])...}
而 r2 就像:
{ (1,["a2","b2","c2"]) (3,["x2","y2","z2"])...}
如果我们看到这里,键(1)存在于两个RDD中,所以它很关心,现在我想扫描两个RDD中的值部分,并将元素与具有相同索引的元素一一比较在另一个 RDD 中,当我发现相同的元素(在 r1 的元组和 r2 的元组中具有相同的索引)时,我 return 其索引的值,让我们解释一下
this is the tuple that has the key 1 in r1 :
(1,["a1","b1","c1"])
and this is the tuple that has the key 1 in r2 :
(1,["a2","b2","c2"])
扫一扫,我比较"a1"和"a2","b1"和"b2","c1"和"c2"
我假设经过比较我发现:
"a1".equals"a2"=true, "b1".equals"b2"=false, and "c1".equals"c2"=false
知道java中表的索引是从0开始的,正如我之前所说的,我想return不等于的元素的索引,按照这个例子我会return index1=1 和 index2=2,我该怎么做?
Note: if i have to return more than one index, i think it ll be better that i collect them in one RDD of INtegers named
JavaRDD <Integer> indexes
i hope that it s clean, and i ll appreciate any help from your sides, thank you.
您可以使用 join
然后 map
。
JavaPairRDD<Integer,Integer[]> idWithIndexes = r1.join(r2).map(new Function<Tuple2<Integer,Tuple2<String[],String[]>>,Tuple2<Integer,Integer[]>>(){
@Override
public Tuple2<Integer, Integer[]> call(Tuple2<Integer, Tuple2<String[], String[]>> t) throws Exception {
int id = t._1;
String[] s1 = t._2._1;
String[] s2 = t._2._2;
int length = Math.min(s1.length, s2.length);
List<Integer> index = new ArrayList<Integer>();
for (int i = 0; i < length; i++) {
if (!s1[i].equals(s2[i])) {
index.add(i);
}
}
return new Tuple2<Integer,Integer[]>(id, index.toArray(new Integer[0]));
}
});
这个returnsJavaPairRDD
的id和index数组。