如何在 scala 的数据框中划分我的字段值
how to partition my field value in dataframe on scala
我有一个数据框,架构如下:
root
|-- school: string (nullable = true)
|-- questionName: string (nullable = true)
|-- difficultyValue: double (nullable = true)
数据是这样的:
school | questionName | difficultyValue
school1 | q1 | 0.32
school1 | q2 | 0.13
school1 | q3 | 0.58
school1 | q4 | 0.67
school1 | q5 | 0.59
school1 | q6 | 0.43
school1 | q7 | 0.31
school1 | q8 | 0.15
school1 | q9 | 0.21
school1 | q10 | 0.92
但现在我想根据其值对字段 "difficultyValue" 进行分区,并将此数据帧转换为具有以下模式的新数据帧:
root
|-- school: string (nullable = true)
|-- difficulty1: double (nullable = true)
|-- difficulty2: double (nullable = true)
|-- difficulty3: double (nullable = true)
|-- difficulty4: double (nullable = true)
|-- difficulty5: double (nullable = true)
新数据 table 在这里:
school | difficulty1 | difficulty2 | difficulty3 | difficulty4 | difficulty5
school1 | 2 | 3 | 3 | 1 |1
字段"difficulty1"的值是"difficultyValue"的个数 < 0.2;
字段"difficulty2"的值为"difficultyValue"<0.4且"difficultyValue">=0.2的个数;
字段"difficulty3"的值为"difficultyValue"<0.6且"difficultyValue">=0.4的个数;
字段"difficulty4"的值为"difficultyValue"<0.8且"difficultyValue">=0.6的个数;
字段"difficulty5"的值为"difficultyValue"<1.0且"difficultyValue">=0.8的个数;
不知道怎么改造,怎么办?
以下函数:
def valueToIndex(v: Double): Int = scala.math.ceil(v*5).toInt
将从难度值确定您想要的索引,因为您只需要 5 个统一的 bin。您可以使用此函数使用 withColumn
和 udf
创建新的派生列,然后您可以使用 pivot
生成每个索引的行数。
// First create a test data frame with the schema of your given source.
val df = {
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
val simpleSchema = StructType(
StructField("school", StringType, false) ::
StructField("questionName", StringType, false) ::
StructField("difficultyValue", DoubleType) :: Nil)
val data = List(
Row("school1", "q1", 0.32),
Row("school1", "q2", 0.45),
Row("school1", "q3", 0.22),
Row("school1", "q4", 0.12),
Row("school2", "q1", 0.32),
Row("school2", "q2", 0.42),
Row("school2", "q3", 0.52),
Row("school2", "q4", 0.62)
)
spark.createDataFrame(data.asJava, simpleSchema)
}
// Add a new column that is the 1-5 category.
val df2 = df.withColumn("difficultyCat", floor(col("difficultyValue").multiply(5.0)) + 1)
// groupBy and pivot to get the final view that you want.
// Here, we know 1-5 values before-hand, if you don't you can omit with performance cost.
val df3 = df2.groupBy("school").pivot("difficultyCat", Seq(1, 2, 3, 4, 5)).count()
df3.show()
我有一个数据框,架构如下:
root
|-- school: string (nullable = true)
|-- questionName: string (nullable = true)
|-- difficultyValue: double (nullable = true)
数据是这样的:
school | questionName | difficultyValue
school1 | q1 | 0.32
school1 | q2 | 0.13
school1 | q3 | 0.58
school1 | q4 | 0.67
school1 | q5 | 0.59
school1 | q6 | 0.43
school1 | q7 | 0.31
school1 | q8 | 0.15
school1 | q9 | 0.21
school1 | q10 | 0.92
但现在我想根据其值对字段 "difficultyValue" 进行分区,并将此数据帧转换为具有以下模式的新数据帧:
root
|-- school: string (nullable = true)
|-- difficulty1: double (nullable = true)
|-- difficulty2: double (nullable = true)
|-- difficulty3: double (nullable = true)
|-- difficulty4: double (nullable = true)
|-- difficulty5: double (nullable = true)
新数据 table 在这里:
school | difficulty1 | difficulty2 | difficulty3 | difficulty4 | difficulty5
school1 | 2 | 3 | 3 | 1 |1
字段"difficulty1"的值是"difficultyValue"的个数 < 0.2;
字段"difficulty2"的值为"difficultyValue"<0.4且"difficultyValue">=0.2的个数;
字段"difficulty3"的值为"difficultyValue"<0.6且"difficultyValue">=0.4的个数;
字段"difficulty4"的值为"difficultyValue"<0.8且"difficultyValue">=0.6的个数;
字段"difficulty5"的值为"difficultyValue"<1.0且"difficultyValue">=0.8的个数;
不知道怎么改造,怎么办?
以下函数:
def valueToIndex(v: Double): Int = scala.math.ceil(v*5).toInt
将从难度值确定您想要的索引,因为您只需要 5 个统一的 bin。您可以使用此函数使用 withColumn
和 udf
创建新的派生列,然后您可以使用 pivot
生成每个索引的行数。
// First create a test data frame with the schema of your given source.
val df = {
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
val simpleSchema = StructType(
StructField("school", StringType, false) ::
StructField("questionName", StringType, false) ::
StructField("difficultyValue", DoubleType) :: Nil)
val data = List(
Row("school1", "q1", 0.32),
Row("school1", "q2", 0.45),
Row("school1", "q3", 0.22),
Row("school1", "q4", 0.12),
Row("school2", "q1", 0.32),
Row("school2", "q2", 0.42),
Row("school2", "q3", 0.52),
Row("school2", "q4", 0.62)
)
spark.createDataFrame(data.asJava, simpleSchema)
}
// Add a new column that is the 1-5 category.
val df2 = df.withColumn("difficultyCat", floor(col("difficultyValue").multiply(5.0)) + 1)
// groupBy and pivot to get the final view that you want.
// Here, we know 1-5 values before-hand, if you don't you can omit with performance cost.
val df3 = df2.groupBy("school").pivot("difficultyCat", Seq(1, 2, 3, 4, 5)).count()
df3.show()