如何在 Scala Dataframe 中显示分组数据
How to display grouped data in Scala Dataframe
我是 Scala 的初学者,我有一个如下所示的数据框(缩写):
root
|-- contigName: string (nullable = true)
|-- start: long (nullable = true)
|-- end: long (nullable = true)
|-- names: array (nullable = true)
| |-- element: string (containsNull = true)
|-- referenceAllele: string (nullable = true)
|-- alternateAlleles: array (nullable = true)
| |-- element: string (containsNull = true)
我试图简单地 groupBy
名称列:
display(dataframe.groupBy("names"))
一个非常简单的操作,但是
notebook:1: error: overloaded method value display with alternatives:
[A](data: Seq[A])(implicit evidence: reflect.runtime.universe.TypeTag[A])Unit <and>
(dataset: org.apache.spark.sql.Dataset[_],streamName: String,trigger: org.apache.spark.sql.streaming.Trigger,checkpointLocation: String)Unit <and>
(model: org.apache.spark.ml.classification.DecisionTreeClassificationModel)Unit <and>
(model: org.apache.spark.ml.regression.DecisionTreeRegressionModel)Unit <and>
(model: org.apache.spark.ml.clustering.KMeansModel)Unit <and>
(model: org.apache.spark.mllib.clustering.KMeansModel)Unit <and>
(documentable: com.databricks.dbutils_v1.WithHelpMethods)Unit
cannot be applied to (org.apache.spark.sql.RelationalGroupedDataset)
display(dataframe.groupBy("names"))
如何显示这些分组数据?
我看到的一些解决方案已经很复杂了,我不认为这是重复的,我想要的是非常简单的。
groupBy
returnsRelationalGroupedDataset
。您需要添加任何聚合函数(例如 count()
)
dataframe.groupBy("names").count()
或 dataframe.groupBy("names").agg(max("end"))
如果需要按每个名字分组,可以在groupBy
之前展开"names"数组
dataframe
.withColumn("name", explode(col("names")))
.drop("names")
.groupBy("name")
.count() // or other aggregate functions inside agg()
我是 Scala 的初学者,我有一个如下所示的数据框(缩写):
root
|-- contigName: string (nullable = true)
|-- start: long (nullable = true)
|-- end: long (nullable = true)
|-- names: array (nullable = true)
| |-- element: string (containsNull = true)
|-- referenceAllele: string (nullable = true)
|-- alternateAlleles: array (nullable = true)
| |-- element: string (containsNull = true)
我试图简单地 groupBy
名称列:
display(dataframe.groupBy("names"))
一个非常简单的操作,但是
notebook:1: error: overloaded method value display with alternatives:
[A](data: Seq[A])(implicit evidence: reflect.runtime.universe.TypeTag[A])Unit <and>
(dataset: org.apache.spark.sql.Dataset[_],streamName: String,trigger: org.apache.spark.sql.streaming.Trigger,checkpointLocation: String)Unit <and>
(model: org.apache.spark.ml.classification.DecisionTreeClassificationModel)Unit <and>
(model: org.apache.spark.ml.regression.DecisionTreeRegressionModel)Unit <and>
(model: org.apache.spark.ml.clustering.KMeansModel)Unit <and>
(model: org.apache.spark.mllib.clustering.KMeansModel)Unit <and>
(documentable: com.databricks.dbutils_v1.WithHelpMethods)Unit
cannot be applied to (org.apache.spark.sql.RelationalGroupedDataset)
display(dataframe.groupBy("names"))
如何显示这些分组数据?
我看到的一些解决方案已经很复杂了,我不认为这是重复的,我想要的是非常简单的。
groupBy
returnsRelationalGroupedDataset
。您需要添加任何聚合函数(例如 count()
)
dataframe.groupBy("names").count()
或 dataframe.groupBy("names").agg(max("end"))
如果需要按每个名字分组,可以在groupBy
dataframe
.withColumn("name", explode(col("names")))
.drop("names")
.groupBy("name")
.count() // or other aggregate functions inside agg()