如何向 Spark Dataframe 添加排序条件
How to add a sort condition to a Spark Dataframe
我有一个这样的 Dataframe
+-------------------+----+
|DATE |CODE|
+-------------------+----+
|2015/02/30-14:32:32|xv |
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2015/01/30-10:45:16|val2|
|2016/02/30-07:45:26|cv |
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val3|
|2016/02/30-12:50:11|val3|
|2015/01/30-10:45:16|val1|
|2015/11/30-04:45:19|sd |
|2015/05/23-10:32:16|val2|
|2016/09/30-14:45:58|cv |
|2015/08/30-15:45:00|rt |
|2016/01/30-10:35:31|cv |
|2016/06/30-20:35:30|xv |
|2015/05/23-10:32:16|val1|
|2016/07/19-22:05:48|rt |
+-------------------+----+
我使用此代码按日期对我的示例进行排序
val df = sc.parallelize(Seq(
("2015/02/30-14:32:32", "xv"),
("2016/02/30-12:50:11", "val2"),
("2016/02/30-12:50:11", "val2"),
("2016/02/30-12:50:11", "val2"),
("2015/01/30-10:45:16", "val2"),
("2016/02/30-07:45:26", "cv"),
("2016/02/30-12:50:11", "val1"),
("2016/02/30-12:50:11", "val1"),
("2016/02/30-12:50:11", "val1"),
("2016/02/30-12:50:11", "val3"),
("2015/01/30-10:45:16", "val3"),
("2015/11/30-04:45:19", "sd"),
("2015/05/23-10:32:16", "val2"),
("2016/09/30-14:45:58", "cv"),
("2015/08/30-15:45:00", "rt"),
("2016/01/30-10:35:31", "cv"),
("2016/06/30-20:35:30", "xv"),
("2015/05/23-10:32:16", "val1"),
("2016/07/19-22:05:48", "rt")
)).toDF("DATE", "CODE")
val df_sorted = df.sort("DATE")
df_sorted show false
我得到这个结果:
+-------------------+----+
|DATE |CODE|
+-------------------+----+
|2015/01/30-10:45:16|val3|
|2015/01/30-10:45:16|val2|
|2015/02/30-14:32:32|xv |
|2015/05/23-10:32:16|val2|
|2015/05/23-10:32:16|val1|
|2015/08/30-15:45:00|rt |
|2015/11/30-04:45:19|sd |
|2016/01/30-10:35:31|cv |
|2016/02/30-07:45:26|cv |
|2016/02/30-12:50:11|val3|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val1|
|2016/06/30-20:35:30|xv |
|2016/07/19-22:05:48|rt |
|2016/09/30-14:45:58|cv |
+-------------------+----+
我想添加一个排序条件。我希望我的所有代码都按以下顺序以 val 开头:val2、val1、val3,如果它们具有相同的日期 YYYY/MM/DD-hh:mm:ss 并得到此结果:
+-------------------+----+
|DATE |CODE|
+-------------------+----+
|2015/01/30-10:45:16|val2|
|2015/01/30-10:45:16|val1|
|2015/02/30-14:32:32|xv |
|2015/05/23-10:32:16|val2|
|2015/05/23-10:32:16|val1|
|2015/08/30-15:45:00|rt |
|2015/11/30-04:45:19|sd |
|2016/01/30-10:35:31|cv |
|2016/02/30-07:45:26|cv |
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val3|
|2016/02/30-12:50:11|val3|
|2016/06/30-20:35:30|xv |
|2016/07/19-22:05:48|rt |
|2016/09/30-14:45:58|cv |
+-------------------+----+
你有什么想法吗?
您可以按多列排序:
val df_sorted2 = df.sort("DATE","CODE")
df_sorted2.show()
这给了我:
+-------------------+----+
| DATE|CODE|
+-------------------+----+
|2015/01/30-10:45:16|val1|
|2015/01/30-10:45:16|val2|
|2015/02/30-14:32:32| xv|
|2015/05/23-10:32:16|val1|
|2015/05/23-10:32:16|val2|
|2015/08/30-15:45:00| rt|
|2015/11/30-04:45:19| sd|
|2016/01/30-10:35:31| cv|
|2016/02/30-07:45:26| cv|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/06/30-20:35:30| xv|
|2016/07/19-22:05:48| rt|
|2016/09/30-14:45:58| cv|
+-------------------+----+
假设 sc 在 hiveContext 中,如果不是,首先将 sparkContext 包装在 hive 上下文中。
df.registerTempTable("MY_TEMP_TABLE);
val sortedDF = sc.sql("SELECT * FROM MY_TEMP_TABLE ORDER BY DATE ASC, CODE DESC");
sortedDF.show
或任何版本的 SQL 排序 运行。
我有一个这样的 Dataframe
+-------------------+----+
|DATE |CODE|
+-------------------+----+
|2015/02/30-14:32:32|xv |
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2015/01/30-10:45:16|val2|
|2016/02/30-07:45:26|cv |
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val3|
|2016/02/30-12:50:11|val3|
|2015/01/30-10:45:16|val1|
|2015/11/30-04:45:19|sd |
|2015/05/23-10:32:16|val2|
|2016/09/30-14:45:58|cv |
|2015/08/30-15:45:00|rt |
|2016/01/30-10:35:31|cv |
|2016/06/30-20:35:30|xv |
|2015/05/23-10:32:16|val1|
|2016/07/19-22:05:48|rt |
+-------------------+----+
我使用此代码按日期对我的示例进行排序
val df = sc.parallelize(Seq(
("2015/02/30-14:32:32", "xv"),
("2016/02/30-12:50:11", "val2"),
("2016/02/30-12:50:11", "val2"),
("2016/02/30-12:50:11", "val2"),
("2015/01/30-10:45:16", "val2"),
("2016/02/30-07:45:26", "cv"),
("2016/02/30-12:50:11", "val1"),
("2016/02/30-12:50:11", "val1"),
("2016/02/30-12:50:11", "val1"),
("2016/02/30-12:50:11", "val3"),
("2015/01/30-10:45:16", "val3"),
("2015/11/30-04:45:19", "sd"),
("2015/05/23-10:32:16", "val2"),
("2016/09/30-14:45:58", "cv"),
("2015/08/30-15:45:00", "rt"),
("2016/01/30-10:35:31", "cv"),
("2016/06/30-20:35:30", "xv"),
("2015/05/23-10:32:16", "val1"),
("2016/07/19-22:05:48", "rt")
)).toDF("DATE", "CODE")
val df_sorted = df.sort("DATE")
df_sorted show false
我得到这个结果:
+-------------------+----+
|DATE |CODE|
+-------------------+----+
|2015/01/30-10:45:16|val3|
|2015/01/30-10:45:16|val2|
|2015/02/30-14:32:32|xv |
|2015/05/23-10:32:16|val2|
|2015/05/23-10:32:16|val1|
|2015/08/30-15:45:00|rt |
|2015/11/30-04:45:19|sd |
|2016/01/30-10:35:31|cv |
|2016/02/30-07:45:26|cv |
|2016/02/30-12:50:11|val3|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val1|
|2016/06/30-20:35:30|xv |
|2016/07/19-22:05:48|rt |
|2016/09/30-14:45:58|cv |
+-------------------+----+
我想添加一个排序条件。我希望我的所有代码都按以下顺序以 val 开头:val2、val1、val3,如果它们具有相同的日期 YYYY/MM/DD-hh:mm:ss 并得到此结果:
+-------------------+----+
|DATE |CODE|
+-------------------+----+
|2015/01/30-10:45:16|val2|
|2015/01/30-10:45:16|val1|
|2015/02/30-14:32:32|xv |
|2015/05/23-10:32:16|val2|
|2015/05/23-10:32:16|val1|
|2015/08/30-15:45:00|rt |
|2015/11/30-04:45:19|sd |
|2016/01/30-10:35:31|cv |
|2016/02/30-07:45:26|cv |
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val3|
|2016/02/30-12:50:11|val3|
|2016/06/30-20:35:30|xv |
|2016/07/19-22:05:48|rt |
|2016/09/30-14:45:58|cv |
+-------------------+----+
你有什么想法吗?
您可以按多列排序:
val df_sorted2 = df.sort("DATE","CODE")
df_sorted2.show()
这给了我:
+-------------------+----+
| DATE|CODE|
+-------------------+----+
|2015/01/30-10:45:16|val1|
|2015/01/30-10:45:16|val2|
|2015/02/30-14:32:32| xv|
|2015/05/23-10:32:16|val1|
|2015/05/23-10:32:16|val2|
|2015/08/30-15:45:00| rt|
|2015/11/30-04:45:19| sd|
|2016/01/30-10:35:31| cv|
|2016/02/30-07:45:26| cv|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/06/30-20:35:30| xv|
|2016/07/19-22:05:48| rt|
|2016/09/30-14:45:58| cv|
+-------------------+----+
假设 sc 在 hiveContext 中,如果不是,首先将 sparkContext 包装在 hive 上下文中。
df.registerTempTable("MY_TEMP_TABLE);
val sortedDF = sc.sql("SELECT * FROM MY_TEMP_TABLE ORDER BY DATE ASC, CODE DESC");
sortedDF.show
或任何版本的 SQL 排序 运行。