数据框中有条件的增量添加
Incremental addition with condition in dataframe
我有一个这样的 DataFrame :
finalSondDF.show()
+---------------+------------+----------------+
|webService_Name|responseTime|numberOfSameTime|
+---------------+------------+----------------+
| webservice1| 80| 1|
| webservice1| 87| 2|
| webservice1| 283| 1|
| webservice2| 77| 2|
| webservice2| 80| 1|
| webservice2| 81| 1|
| webservice3| 63| 3|
| webservice3| 145| 1|
| webservice4| 167| 1|
| webservice4| 367| 2|
| webservice4| 500| 1|
+---------------+------------+----------------+
我想得到这样的结果:
+---------------+------------+----------------+------+
|webService_Name|responseTime|numberOfSameTime|Result|
+---------------+------------+----------------+------+
| webservice1| 80| 1| 1|
| webservice1| 87| 2| 3| ==> 2+1
| webservice1| 283| 1| 4| ==> 1+2+1
| webservice2| 77| 2| 2|
| webservice2| 80| 1| 3| ==> 2+1
| webservice2| 81| 1| 4| ==> 2+1+1
| webservice3| 63| 3| 3|
| webservice3| 145| 1| 4| ==> 3+1
| webservice4| 167| 1| 1|
| webservice4| 367| 2| 3| ==> 1+2
| webservice4| 500| 1| 4| ==> 1+2+1
+---------------+------------+----------------+------+
这里的结果是numberOfSameTime
低于当前responseTime
的总和
我找不到这样做的逻辑。谁能帮帮我!!
如果您的数据按 responseTime
列 的每组 webService_Name
列递增排列,那么您可以从 累积中获益sum 使用 Window
函数如下
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("webService_Name").orderBy("responseTime")
import org.apache.spark.sql.functions._
df.withColumn("Result", sum("numberOfSameTime").over(windowSpec)).show(false)
你应该
+---------------+------------+----------------+------+
|webService_Name|responseTime|numberOfSameTime|Result|
+---------------+------------+----------------+------+
|webservice1 |80 |1 |1 |
|webservice1 |87 |2 |3 |
|webservice1 |283 |1 |4 |
|webservice2 |80 |1 |3 |
|webservice2 |81 |1 |4 |
|webservice2 |77 |2 |2 |
|webservice3 |145 |1 |4 |
|webservice3 |63 |3 |3 |
|webservice4 |167 |1 |1 |
|webservice4 |367 |2 |3 |
|webservice4 |500 |1 |4 |
+---------------+------------+----------------+------+
注意 responseTime
是 数字类型 并且每个 递增顺序 webService_Name
使上述情况有效
您可以使用 spark 中可用的 Window
函数并计算 cumulative
sum
,如下所示。
//dummy data
val d1 = spark.sparkContext.parallelize(Seq(
("webservice1", 80, 1),
("webservice1", 87, 2),
("webservice1", 283, 1),
("webservice2", 77, 2),
("webservice2", 80, 1),
("webservice2", 81, 1),
("webservice3", 63, 3),
("webservice3", 145, 1),
("webservice4", 167, 1),
("webservice4", 367, 2),
("webservice4", 500, 1)
)).toDF("webService_Name","responseTime","numberOfSameTime")
//window functionn
val window = Window.partitionBy("webService_Name").orderBy($"webService_Name")
.rowsBetween(Long.MinValue, 0)
// create new column for Result
d1.withColumn("Result", sum("numberOfSameTime").over(window)).show(false)
输出:
+---------------+------------+----------------+------+
|webService_Name|responseTime|numberOfSameTime|Result|
+---------------+------------+----------------+------+
|webservice4 |167 |1 |1 |
|webservice4 |367 |2 |3 |
|webservice4 |500 |1 |4 |
|webservice2 |77 |2 |2 |
|webservice2 |80 |1 |3 |
|webservice2 |81 |1 |4 |
|webservice3 |63 |3 |3 |
|webservice3 |145 |1 |4 |
|webservice1 |80 |1 |1 |
|webservice1 |87 |2 |3 |
|webservice1 |283 |1 |4 |
+---------------+------------+----------------+------+
希望对您有所帮助!
我有一个这样的 DataFrame :
finalSondDF.show()
+---------------+------------+----------------+
|webService_Name|responseTime|numberOfSameTime|
+---------------+------------+----------------+
| webservice1| 80| 1|
| webservice1| 87| 2|
| webservice1| 283| 1|
| webservice2| 77| 2|
| webservice2| 80| 1|
| webservice2| 81| 1|
| webservice3| 63| 3|
| webservice3| 145| 1|
| webservice4| 167| 1|
| webservice4| 367| 2|
| webservice4| 500| 1|
+---------------+------------+----------------+
我想得到这样的结果:
+---------------+------------+----------------+------+
|webService_Name|responseTime|numberOfSameTime|Result|
+---------------+------------+----------------+------+
| webservice1| 80| 1| 1|
| webservice1| 87| 2| 3| ==> 2+1
| webservice1| 283| 1| 4| ==> 1+2+1
| webservice2| 77| 2| 2|
| webservice2| 80| 1| 3| ==> 2+1
| webservice2| 81| 1| 4| ==> 2+1+1
| webservice3| 63| 3| 3|
| webservice3| 145| 1| 4| ==> 3+1
| webservice4| 167| 1| 1|
| webservice4| 367| 2| 3| ==> 1+2
| webservice4| 500| 1| 4| ==> 1+2+1
+---------------+------------+----------------+------+
这里的结果是numberOfSameTime
低于当前responseTime
的总和
我找不到这样做的逻辑。谁能帮帮我!!
如果您的数据按 responseTime
列 的每组 webService_Name
列递增排列,那么您可以从 累积中获益sum 使用 Window
函数如下
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("webService_Name").orderBy("responseTime")
import org.apache.spark.sql.functions._
df.withColumn("Result", sum("numberOfSameTime").over(windowSpec)).show(false)
你应该
+---------------+------------+----------------+------+
|webService_Name|responseTime|numberOfSameTime|Result|
+---------------+------------+----------------+------+
|webservice1 |80 |1 |1 |
|webservice1 |87 |2 |3 |
|webservice1 |283 |1 |4 |
|webservice2 |80 |1 |3 |
|webservice2 |81 |1 |4 |
|webservice2 |77 |2 |2 |
|webservice3 |145 |1 |4 |
|webservice3 |63 |3 |3 |
|webservice4 |167 |1 |1 |
|webservice4 |367 |2 |3 |
|webservice4 |500 |1 |4 |
+---------------+------------+----------------+------+
注意 responseTime
是 数字类型 并且每个 递增顺序 webService_Name
使上述情况有效
您可以使用 spark 中可用的 Window
函数并计算 cumulative
sum
,如下所示。
//dummy data
val d1 = spark.sparkContext.parallelize(Seq(
("webservice1", 80, 1),
("webservice1", 87, 2),
("webservice1", 283, 1),
("webservice2", 77, 2),
("webservice2", 80, 1),
("webservice2", 81, 1),
("webservice3", 63, 3),
("webservice3", 145, 1),
("webservice4", 167, 1),
("webservice4", 367, 2),
("webservice4", 500, 1)
)).toDF("webService_Name","responseTime","numberOfSameTime")
//window functionn
val window = Window.partitionBy("webService_Name").orderBy($"webService_Name")
.rowsBetween(Long.MinValue, 0)
// create new column for Result
d1.withColumn("Result", sum("numberOfSameTime").over(window)).show(false)
输出:
+---------------+------------+----------------+------+
|webService_Name|responseTime|numberOfSameTime|Result|
+---------------+------------+----------------+------+
|webservice4 |167 |1 |1 |
|webservice4 |367 |2 |3 |
|webservice4 |500 |1 |4 |
|webservice2 |77 |2 |2 |
|webservice2 |80 |1 |3 |
|webservice2 |81 |1 |4 |
|webservice3 |63 |3 |3 |
|webservice3 |145 |1 |4 |
|webservice1 |80 |1 |1 |
|webservice1 |87 |2 |3 |
|webservice1 |283 |1 |4 |
+---------------+------------+----------------+------+
希望对您有所帮助!