Scala 中的嵌套分组依据和聚合
Nested GroupBy & Aggregation in Scala
我正在尝试 groupBy
基于 ResourceId
& Category
和 return 相应的可用的最高严重性级别。
严重性等级为严重 > 主要 > 次要。即按 ResourceId
和 Category
分组后,我们需要 return 该组的最高严重性。
case class Issue(
resourceId: String,
Category: String,
Severity: String,
incidentType: String
)
case class IssueStatus(
resourceId:String,
Hardware: Option[String],
Network: Option[String],
Software: Option[String]
)
List(
Issue("r1", "Network", "Critical", "incident1"),
Issue("r1", "Network", "Major", "incident2"),
Issue("r1", "Hardware", "Minor", "incident 3"),
Issue("r2", "Hardware", "Major", "incident 3"),
Issue("r3", "Software", "Minor", "incident 1"),
)
预期输出:
List(
IssueStatus("r1", Some("Minor"), Some("Critical"), None),
IssueStatus("r2", Some("Major"), None, None),
IssueStatus("r3", None, None, Some("Minor"))
)
更新:
类别映射到案例对象。即我们只有 3 个类别:网络、硬件和软件。
对于每个资源,我想知道每个类别中的最高严重性是什么。如果网络类别的严重性最高为严重,并且没有用于资源 r5
的软件和硬件类别的条目,则相应的 IssueStatus
将类似于
IssueStatus("r5", None, Some("Critical"), None)
case class Issue(
resourceId: String,
Category: String,
Severity: String,
incidentType: String
)
case class IssueStatus(
resourceId: String,
Hardware: Option[String],
Network: Option[String],
Software: Option[String]
)
val p = List(
Issue("r1", "Network", "Critical", "incident1"),
Issue("r1", "Hardware", "Minor", "incident 3"),
Issue("r2", "Hardware", "Major", "incident 3"),
Issue("r3", "Software", "Minor", "incident 1")
)
def getIssues(lstOfIssue: List[Issue], typeOfIssue: String): Option[String] = {
lstOfIssue.find(_.Category == typeOfIssue) match {
case Some(v) => Some(v.Severity)
case _ => None
}
}
def computeIssueStatus(listOfIssues: List[Issue]): List[IssueStatus] = {
listOfIssues.groupBy(issue => issue.resourceId)
.map(kv =>
IssueStatus(kv._1, getIssues(kv._2, "Hardware"), getIssues(kv._2, "Network"), getIssues(kv._2, "Software")))
.toList
}
computeIssueStatus(p)
这是我对 "issue" 的看法。
val input = List(
Issue("r1", "Network", "Critical", "incident1"),
Issue("r1", "Network", "Major", "incident2"),
Issue("r1", "Hardware", "Minor", "incident 3"),
Issue("r2", "Hardware", "Major", "incident 3"),
Issue("r3", "Software", "Minor", "incident 1"),
Issue("r3", "Software", "Critical", "incident 1"), // added 2 more for testing
Issue("r3", "Software", "Major", "incident 1"),
)
val res = input.groupBy(_.resourceId)
.mapValues(_.groupBy(_.Category)
.mapValues(_.map(_.Severity).min))
.map{ case (k,m) =>
IssueStatus(k, m.get("Hardware"), m.get("Network"), m.get("Software"))
}.toList
//res: List[IssueStatus] = List(IssueStatus(r3,None,None,Some(Critical))
// , IssueStatus(r2,Some(Major),None,None)
// , IssueStatus(r1,Some(Minor),Some(Critical),None))
注意:有一个不幸的小技巧,它依赖于 "Critical"、"Major" 和 "Minor" 的字母顺序,较早的优先于后者。如果 Severity
字符串是 "Bad"、"Very Bad" 和 "Doomed".
,这将不起作用
我相信这符合您的要求:
def highestIssueStatus(issues: List[Issue]): IssueStatus = {
def issueRank(issue: Issue): Int =
List("Minor", "Major", "Critical").indexOf(issue.Severity)
val high = issues
.groupBy(_.Category)
.mapValues(_.maxBy(issueRank).Severity)
IssueStatus(
issues.head.resourceId,
high.get("Hardware"),
high.get("Network"),
high.get("Software")
)
}
list.groupBy(_.resourceId).values.map(highestIssueStatus)
更新
感谢 Yaneeve 指出原文中的错误(issueRank
看起来是 _.Category
而不是 _.Severity
)
优化
根据 OP 的评论,这里有一个针对此问题的更优化但功能更少的解决方案。它一次性将答案构建到可变映射中,而不是使用 groupBy
然后处理结果。
val categories = Vector("Hardware", "Network", "Software")
val severities = Vector("Minor", "Major", "Critical")
val results = Vector(None) ++ severities.map(Some(_))
def parseIssues(issues: List[Issue]) = {
val issueMap = mutable.Map.empty[String, ArrayBuffer[Int]]
issues.foreach{ issue =>
val cat = categories.indexOf(issue.Category) + 1
val sev = severities.indexOf(issue.Severity) + 1
val cur = issueMap.get(issue.resourceId) match {
case Some(v) => v
case None =>
val n = ArrayBuffer(0, 0, 0, 0)
issueMap(issue.resourceId) = n
n
}
if (cur(cat) < sev) {
cur(cat) = sev
}
}
issueMap.map{ case (k, v) =>
IssueStatus(k, results(v(1)), results(v(2)), results(v(3)))
}
}
另一个优化是使用标量值而不是 String
作为类别和严重性。这将避免在主循环中调用 indexOf
并允许 mutable.Map
直接存储 Option[Severity]
而不是作为 results
.
的索引
这种方法也可以用于流式传输模式,其中状态更新在进入时不断添加到 Map
,并且可以随时提取最新状态。映射值是可变的,因此当问题被清除时,资源的状态可以重置为 0
(None
)。此处需要考虑线程安全问题,因此可以将其放置在 Akka Actor
.
中
我即将完成最后一步。仍在努力根据 resourceId 合并 IssueStatus。检查这个。
scala> case class Issue(
| resourceId: String,
| Category: String,
| Severity: String,
| incidentType: String
| )
defined class Issue
scala> case class IssueStatus(
| resourceId:String,
| Hardware: Option[String],
| Network: Option[String],
| Software: Option[String]
| )
defined class IssueStatus
scala>
scala> val issueList = List(
| Issue("r1", "Network", "Critical", "incident1"),
| Issue("r1", "Network", "Major", "incident2"),
| Issue("r1", "Hardware", "Minor", "incident 3"),
| Issue("r2", "Hardware", "Major", "incident 3"),
| Issue("r3", "Software", "Minor", "incident 1")
| )
issueList: List[Issue] = List(Issue(r1,Network,Critical,incident1), Issue(r1,Network,Major,incident2), Issue(r1,Hardware,Minor,incident 3), Issue(r2,Hardware,Major,incident 3), Issue(r3,Software,Minor,incident 1))
scala> val proc1 = issueList.groupBy( x=> (x.resourceId,x.Category)).map( x=>(x._1,(x._2).sortWith( (p,q) => p.Category > q.Category)(0))).map( x=> (x._1._1,x._1._2,x._2.Severity))
proc1: scala.collection.immutable.Iterable[(String, String, String)] = List((r1,Hardware,Minor), (r3,Software,Minor), (r2,Hardware,Major), (r1,Network,Critical))
scala> val proc2 = proc1.map( x => x match { case(a,"Hardware",c) => IssueStatus(a,Some(c),None,None) case(a,"Network",c) => IssueStatus(a,None,Some(c),None) case(a,"Software",c) => IssueStatus(a,None,None,Some(c)) } )
proc2: scala.collection.immutable.Iterable[IssueStatus] = List(IssueStatus(r1,Some(Minor),None,None), IssueStatus(r3,None,None,Some(Minor)), IssueStatus(r2,Some(Major),None,None), IssueStatus(r1,None,Some(Critical),None))
scala>
scala> proc2.foreach(println)
IssueStatus(r1,Some(Minor),None,None)
IssueStatus(r3,None,None,Some(Minor))
IssueStatus(r2,Some(Major),None,None)
IssueStatus(r1,None,Some(Critical),None)
scala>
再考虑一下解决方案:)
val input = List(
Issue("r1", "Network", "Critical", "incident1"),
Issue("r1", "Network", "Major", "incident2"),
Issue("r1", "Hardware", "Major", "incident5"),
Issue("r1", "Hardware", "Minor", "incident 3"),
Issue("r2", "Hardware", "Major", "incident 6"),
Issue("r2", "Hardware", "Critical", "incident 13"),
Issue("r3", "Software", "Minor", "incident 1"),
Issue("r3", "Network", "Major", "incident 1"),
)
val ranked = input.groupBy(_.resourceId).flatMap {case (resourceId, issuesByResource) =>
issuesByResource.groupBy(_.Category). map { case (category, issuesByCategoryPerResource) =>
implicit val _ : Ordering[Issue] = (lhs: Issue, rhs: Issue) => {
(lhs.Severity, rhs.Severity) match {
case ("Critical", _) => -1
case (_, "Critical") => 1
case ("Major", _) => -1
case (_, "Major") => 1
case _ => -1
}
}
(resourceId, category, issuesByCategoryPerResource.min.Severity)
}
}
val grouped = ranked.groupBy(_._1)
val resourceIdToRawIssueStatus = grouped.mapValues { _. map {case (_, cat, sev) => cat -> sev}.toMap}
resourceIdToRawIssueStatus.map{ case (rId, statusesByCat) =>
IssueStatus(rId, statusesByCat.get("Hardware"), statusesByCat.get("Network"), statusesByCat.get("Software"))
}
小记,我平时不喜欢用mapValues
因为它其实是一个"view"
我正在尝试 groupBy
基于 ResourceId
& Category
和 return 相应的可用的最高严重性级别。
严重性等级为严重 > 主要 > 次要。即按 ResourceId
和 Category
分组后,我们需要 return 该组的最高严重性。
case class Issue(
resourceId: String,
Category: String,
Severity: String,
incidentType: String
)
case class IssueStatus(
resourceId:String,
Hardware: Option[String],
Network: Option[String],
Software: Option[String]
)
List(
Issue("r1", "Network", "Critical", "incident1"),
Issue("r1", "Network", "Major", "incident2"),
Issue("r1", "Hardware", "Minor", "incident 3"),
Issue("r2", "Hardware", "Major", "incident 3"),
Issue("r3", "Software", "Minor", "incident 1"),
)
预期输出:
List(
IssueStatus("r1", Some("Minor"), Some("Critical"), None),
IssueStatus("r2", Some("Major"), None, None),
IssueStatus("r3", None, None, Some("Minor"))
)
更新:
类别映射到案例对象。即我们只有 3 个类别:网络、硬件和软件。
对于每个资源,我想知道每个类别中的最高严重性是什么。如果网络类别的严重性最高为严重,并且没有用于资源 r5
的软件和硬件类别的条目,则相应的 IssueStatus
将类似于
IssueStatus("r5", None, Some("Critical"), None)
case class Issue(
resourceId: String,
Category: String,
Severity: String,
incidentType: String
)
case class IssueStatus(
resourceId: String,
Hardware: Option[String],
Network: Option[String],
Software: Option[String]
)
val p = List(
Issue("r1", "Network", "Critical", "incident1"),
Issue("r1", "Hardware", "Minor", "incident 3"),
Issue("r2", "Hardware", "Major", "incident 3"),
Issue("r3", "Software", "Minor", "incident 1")
)
def getIssues(lstOfIssue: List[Issue], typeOfIssue: String): Option[String] = {
lstOfIssue.find(_.Category == typeOfIssue) match {
case Some(v) => Some(v.Severity)
case _ => None
}
}
def computeIssueStatus(listOfIssues: List[Issue]): List[IssueStatus] = {
listOfIssues.groupBy(issue => issue.resourceId)
.map(kv =>
IssueStatus(kv._1, getIssues(kv._2, "Hardware"), getIssues(kv._2, "Network"), getIssues(kv._2, "Software")))
.toList
}
computeIssueStatus(p)
这是我对 "issue" 的看法。
val input = List(
Issue("r1", "Network", "Critical", "incident1"),
Issue("r1", "Network", "Major", "incident2"),
Issue("r1", "Hardware", "Minor", "incident 3"),
Issue("r2", "Hardware", "Major", "incident 3"),
Issue("r3", "Software", "Minor", "incident 1"),
Issue("r3", "Software", "Critical", "incident 1"), // added 2 more for testing
Issue("r3", "Software", "Major", "incident 1"),
)
val res = input.groupBy(_.resourceId)
.mapValues(_.groupBy(_.Category)
.mapValues(_.map(_.Severity).min))
.map{ case (k,m) =>
IssueStatus(k, m.get("Hardware"), m.get("Network"), m.get("Software"))
}.toList
//res: List[IssueStatus] = List(IssueStatus(r3,None,None,Some(Critical))
// , IssueStatus(r2,Some(Major),None,None)
// , IssueStatus(r1,Some(Minor),Some(Critical),None))
注意:有一个不幸的小技巧,它依赖于 "Critical"、"Major" 和 "Minor" 的字母顺序,较早的优先于后者。如果 Severity
字符串是 "Bad"、"Very Bad" 和 "Doomed".
我相信这符合您的要求:
def highestIssueStatus(issues: List[Issue]): IssueStatus = {
def issueRank(issue: Issue): Int =
List("Minor", "Major", "Critical").indexOf(issue.Severity)
val high = issues
.groupBy(_.Category)
.mapValues(_.maxBy(issueRank).Severity)
IssueStatus(
issues.head.resourceId,
high.get("Hardware"),
high.get("Network"),
high.get("Software")
)
}
list.groupBy(_.resourceId).values.map(highestIssueStatus)
更新
感谢 Yaneeve 指出原文中的错误(issueRank
看起来是 _.Category
而不是 _.Severity
)
优化
根据 OP 的评论,这里有一个针对此问题的更优化但功能更少的解决方案。它一次性将答案构建到可变映射中,而不是使用 groupBy
然后处理结果。
val categories = Vector("Hardware", "Network", "Software")
val severities = Vector("Minor", "Major", "Critical")
val results = Vector(None) ++ severities.map(Some(_))
def parseIssues(issues: List[Issue]) = {
val issueMap = mutable.Map.empty[String, ArrayBuffer[Int]]
issues.foreach{ issue =>
val cat = categories.indexOf(issue.Category) + 1
val sev = severities.indexOf(issue.Severity) + 1
val cur = issueMap.get(issue.resourceId) match {
case Some(v) => v
case None =>
val n = ArrayBuffer(0, 0, 0, 0)
issueMap(issue.resourceId) = n
n
}
if (cur(cat) < sev) {
cur(cat) = sev
}
}
issueMap.map{ case (k, v) =>
IssueStatus(k, results(v(1)), results(v(2)), results(v(3)))
}
}
另一个优化是使用标量值而不是 String
作为类别和严重性。这将避免在主循环中调用 indexOf
并允许 mutable.Map
直接存储 Option[Severity]
而不是作为 results
.
这种方法也可以用于流式传输模式,其中状态更新在进入时不断添加到 Map
,并且可以随时提取最新状态。映射值是可变的,因此当问题被清除时,资源的状态可以重置为 0
(None
)。此处需要考虑线程安全问题,因此可以将其放置在 Akka Actor
.
我即将完成最后一步。仍在努力根据 resourceId 合并 IssueStatus。检查这个。
scala> case class Issue(
| resourceId: String,
| Category: String,
| Severity: String,
| incidentType: String
| )
defined class Issue
scala> case class IssueStatus(
| resourceId:String,
| Hardware: Option[String],
| Network: Option[String],
| Software: Option[String]
| )
defined class IssueStatus
scala>
scala> val issueList = List(
| Issue("r1", "Network", "Critical", "incident1"),
| Issue("r1", "Network", "Major", "incident2"),
| Issue("r1", "Hardware", "Minor", "incident 3"),
| Issue("r2", "Hardware", "Major", "incident 3"),
| Issue("r3", "Software", "Minor", "incident 1")
| )
issueList: List[Issue] = List(Issue(r1,Network,Critical,incident1), Issue(r1,Network,Major,incident2), Issue(r1,Hardware,Minor,incident 3), Issue(r2,Hardware,Major,incident 3), Issue(r3,Software,Minor,incident 1))
scala> val proc1 = issueList.groupBy( x=> (x.resourceId,x.Category)).map( x=>(x._1,(x._2).sortWith( (p,q) => p.Category > q.Category)(0))).map( x=> (x._1._1,x._1._2,x._2.Severity))
proc1: scala.collection.immutable.Iterable[(String, String, String)] = List((r1,Hardware,Minor), (r3,Software,Minor), (r2,Hardware,Major), (r1,Network,Critical))
scala> val proc2 = proc1.map( x => x match { case(a,"Hardware",c) => IssueStatus(a,Some(c),None,None) case(a,"Network",c) => IssueStatus(a,None,Some(c),None) case(a,"Software",c) => IssueStatus(a,None,None,Some(c)) } )
proc2: scala.collection.immutable.Iterable[IssueStatus] = List(IssueStatus(r1,Some(Minor),None,None), IssueStatus(r3,None,None,Some(Minor)), IssueStatus(r2,Some(Major),None,None), IssueStatus(r1,None,Some(Critical),None))
scala>
scala> proc2.foreach(println)
IssueStatus(r1,Some(Minor),None,None)
IssueStatus(r3,None,None,Some(Minor))
IssueStatus(r2,Some(Major),None,None)
IssueStatus(r1,None,Some(Critical),None)
scala>
再考虑一下解决方案:)
val input = List(
Issue("r1", "Network", "Critical", "incident1"),
Issue("r1", "Network", "Major", "incident2"),
Issue("r1", "Hardware", "Major", "incident5"),
Issue("r1", "Hardware", "Minor", "incident 3"),
Issue("r2", "Hardware", "Major", "incident 6"),
Issue("r2", "Hardware", "Critical", "incident 13"),
Issue("r3", "Software", "Minor", "incident 1"),
Issue("r3", "Network", "Major", "incident 1"),
)
val ranked = input.groupBy(_.resourceId).flatMap {case (resourceId, issuesByResource) =>
issuesByResource.groupBy(_.Category). map { case (category, issuesByCategoryPerResource) =>
implicit val _ : Ordering[Issue] = (lhs: Issue, rhs: Issue) => {
(lhs.Severity, rhs.Severity) match {
case ("Critical", _) => -1
case (_, "Critical") => 1
case ("Major", _) => -1
case (_, "Major") => 1
case _ => -1
}
}
(resourceId, category, issuesByCategoryPerResource.min.Severity)
}
}
val grouped = ranked.groupBy(_._1)
val resourceIdToRawIssueStatus = grouped.mapValues { _. map {case (_, cat, sev) => cat -> sev}.toMap}
resourceIdToRawIssueStatus.map{ case (rId, statusesByCat) =>
IssueStatus(rId, statusesByCat.get("Hardware"), statusesByCat.get("Network"), statusesByCat.get("Software"))
}
小记,我平时不喜欢用mapValues
因为它其实是一个"view"