Spark 1.6.0 DataFrame 自连接问题
Spark 1.6.0 DataFrame selfjoin issue
我正在尝试使用 DataFrame Scala API 执行自连接。
这是我的代码片段;
你能告诉我第一个解决方案有什么问题吗?
val df= sqlc.read.json("empMgr.json");
empMgr.json
{"ID":101,"ename":"Peter","sal":24.24,"dept":"11","country": "US","doj":"1/12/2017","mgr":201}
{"ID":201,"ename":"John","sal":1300,"dept":"232","country":"IN" ,"doj":"2016/4/22","mgr":111}
{"ID":301,"ename":"Sam","dept":"22","country":"KR","doj":" 5/22/2015","mgr":201}
// 1. following is not working
var df_right=df;
df.join(df_right, df("mgr") === df_right("ID")).show()
df.join(df, df("mgr") === df("ID")).show()
/*
* Output:
* +---+-------+----+---+-----+---+---+---+-------+----+---+-----+---+---+
| ID|country|dept|doj|ename|mgr|sal| ID|country|dept|doj|ename|mgr|sal|
+---+-------+----+---+-----+---+---+---+-------+----+---+-----+---+---+
+---+-------+----+---+-----+---+---+---+-------+----+---+-----+---+---+
* */
//2. following works fine
df_right= sqlc.read.json("file:///opt/data/empMgr.json");
df.join(df_right, df("mgr") === df_right("ID")).show()
/*
*Output:
* +---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+
| ID|country|dept| doj|ename|mgr| sal| ID|country|dept| doj|ename|mgr| sal|
+---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+
|101| US| 11|1/12/2017|Peter|201|24.24|201| IN| 232|4/22/2016| John|111|1300.0|
|301| KR| 22|5/22/2015| Sam|201| null|201| IN| 232|4/22/2016| John|111|1300.0|
+---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+
* */
//3. following works fine
df.registerTempTable("empMgr")
sqlc.sql("select b.ename, a.ename as mgr,b.mgr from empMgr a join empMgr b on a.ID=b.mgr").show();
/*
* output
* +-----+----+---+
|ename| mgr|mgr|
+-----+----+---+
|Peter|John|201|
| Sam|John|201|
+-----+----+---+
* */
使用 Dataframe 的 as()
方法来消除引用相似名称时的歧义。
df.as("a").join(df.as("b"), $"a.mgr" === $"b.ID").show
+---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+
| ID|country|dept| doj|ename|mgr| sal| ID|country|dept| doj|ename|mgr| sal|
+---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+
|101| US| 11|1/12/2017|Peter|201|24.24|201| IN| 232|4/22/2016| John|111|1300.0|
|301| KR| 22|5/22/2015| Sam|201| null|201| IN| 232|4/22/2016| John|111|1300.0|
+---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+
我正在尝试使用 DataFrame Scala API 执行自连接。 这是我的代码片段; 你能告诉我第一个解决方案有什么问题吗?
val df= sqlc.read.json("empMgr.json");
empMgr.json
{"ID":101,"ename":"Peter","sal":24.24,"dept":"11","country": "US","doj":"1/12/2017","mgr":201} {"ID":201,"ename":"John","sal":1300,"dept":"232","country":"IN" ,"doj":"2016/4/22","mgr":111} {"ID":301,"ename":"Sam","dept":"22","country":"KR","doj":" 5/22/2015","mgr":201}
// 1. following is not working
var df_right=df;
df.join(df_right, df("mgr") === df_right("ID")).show()
df.join(df, df("mgr") === df("ID")).show()
/*
* Output:
* +---+-------+----+---+-----+---+---+---+-------+----+---+-----+---+---+
| ID|country|dept|doj|ename|mgr|sal| ID|country|dept|doj|ename|mgr|sal|
+---+-------+----+---+-----+---+---+---+-------+----+---+-----+---+---+
+---+-------+----+---+-----+---+---+---+-------+----+---+-----+---+---+
* */
//2. following works fine
df_right= sqlc.read.json("file:///opt/data/empMgr.json");
df.join(df_right, df("mgr") === df_right("ID")).show()
/*
*Output:
* +---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+
| ID|country|dept| doj|ename|mgr| sal| ID|country|dept| doj|ename|mgr| sal|
+---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+
|101| US| 11|1/12/2017|Peter|201|24.24|201| IN| 232|4/22/2016| John|111|1300.0|
|301| KR| 22|5/22/2015| Sam|201| null|201| IN| 232|4/22/2016| John|111|1300.0|
+---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+
* */
//3. following works fine
df.registerTempTable("empMgr")
sqlc.sql("select b.ename, a.ename as mgr,b.mgr from empMgr a join empMgr b on a.ID=b.mgr").show();
/*
* output
* +-----+----+---+
|ename| mgr|mgr|
+-----+----+---+
|Peter|John|201|
| Sam|John|201|
+-----+----+---+
* */
使用 Dataframe 的 as()
方法来消除引用相似名称时的歧义。
df.as("a").join(df.as("b"), $"a.mgr" === $"b.ID").show
+---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+
| ID|country|dept| doj|ename|mgr| sal| ID|country|dept| doj|ename|mgr| sal|
+---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+
|101| US| 11|1/12/2017|Peter|201|24.24|201| IN| 232|4/22/2016| John|111|1300.0|
|301| KR| 22|5/22/2015| Sam|201| null|201| IN| 232|4/22/2016| John|111|1300.0|
+---+-------+----+---------+-----+---+-----+---+-------+----+---------+-----+---+------+