Scala 使用简化的正则表达式读取、写入和拒绝记录
scala getting read,write and reject records with simplified regex
我正在处理日志文件以使用 scala 解析 read/written/rejected 记录并将它们转换为地图。这些值出现在不同的行中 - "read" 后跟下一行的 "written",然后是 "rejected"..
我使用的代码片段是
val log_text =
"""
|server.net|Wed Apr 8 05:44:24 2018|acct_reformat.000||finish|
| 120 records ( 7200 bytes) read
| 100 records ( 6000 bytes) written
| 20 records ( 1200 bytes) rejected|
|server.net|Wed Apr 8 05:44:24 2018|acct_reformat_rfm_logs
""".stripMargin
val read_pat = """(\d+) (records) (.*)""".r
val write_pat = """(?s)records .*? (\d+) (records)(.*)""".r
val reject_pat = """(?s).* (\d+) (records)""".r
val read_recs = read_pat.findAllIn(log_text).matchData.map( m=> m.subgroups(0) ).take(1).mkString
val write_recs = write_pat.findAllIn(log_text).matchData.map( m=> m.subgroups(0) ).take(1).mkString
val reject_recs = reject_pat.findAllIn(log_text).matchData.map( m=> m.subgroups(0) ).take(1).mkString
val log_summ = List("Read",read_recs,"Write",write_recs,"Reject",reject_recs).sliding(2,2).map( p => p match { case List(x,y) => (x,y)}).toMap
结果是
log_summ: scala.collection.immutable.Map[String,String] = Map(Read -> 120, Write -> 100, Reject -> 20)
不知怎的,我觉得,我正在以 roundabout/redundant 的方式来做。有没有更好的方法来完成这个?。
我觉得不错。只有三件事需要改进:
1) IntelliJ
是你的朋友。它立即给你两个意图:
m.subgroups(0)
-> m.subgroups.head
map(p => p match { case List(x, y) => (x, y) })
-> map { case List(x, y) => (x, y) }
2) 干燥。不要重复read/write/reject相关代码三次。只要把它放在某个地方一次。例如:
case class Processor(name: String, patternString: String) {
lazy val pattern: Regex = patternString.r
}
val processors = Seq(
Processor("Read", """(\d+) (records) (.*)"""),
Processor("Write", """(?s)records .*? (\d+) (records)(.*)"""),
Processor("Reject", """(?s).* (\d+) (records)"""),
)
def read_recs(processor: Processor) = processor.pattern.findAllIn(log_text).matchData.map(m => m.subgroups.head).take(1).mkString
3) List[Tuple2]
可以通过简单的 toMap
转换为 Map
val log_summ = processors.map(processor => processor.name -> read_recs(processor)).toMap
鉴于 read/write/reject 文本的相似性,您可以将多个 Regex 匹配模式简化为一个通用模式,并使用 zip
生成您的 Map
,如下所示:
val pattern = """(\d+) records .*""".r
val keys = List("Read", "Write", "Reject")
val values = pattern.findAllIn(log_text).matchData.map(_.subgroups(0)).toList
// values: List[String] = List(120, 100, 20)
val log_summ = (keys zip values).toMap
// log_summ: scala.collection.immutable.Map[String,String] =
// Map(Read -> 120, Write -> 100, Reject -> 20)
如果您愿意将日志的措辞用于 Map
键,则可以一次性完成。
val Pattern = raw"(\d+) records .*\) ([^|]+)".r.unanchored
log_text.split("\n").flatMap{
case Pattern(num, typ) => Some(typ -> num)
case _ => None
}.toMap
//res0: immutable.Map[String,String] = Map(read -> 120, written -> 100, rejected -> 20)
我正在处理日志文件以使用 scala 解析 read/written/rejected 记录并将它们转换为地图。这些值出现在不同的行中 - "read" 后跟下一行的 "written",然后是 "rejected"..
我使用的代码片段是
val log_text =
"""
|server.net|Wed Apr 8 05:44:24 2018|acct_reformat.000||finish|
| 120 records ( 7200 bytes) read
| 100 records ( 6000 bytes) written
| 20 records ( 1200 bytes) rejected|
|server.net|Wed Apr 8 05:44:24 2018|acct_reformat_rfm_logs
""".stripMargin
val read_pat = """(\d+) (records) (.*)""".r
val write_pat = """(?s)records .*? (\d+) (records)(.*)""".r
val reject_pat = """(?s).* (\d+) (records)""".r
val read_recs = read_pat.findAllIn(log_text).matchData.map( m=> m.subgroups(0) ).take(1).mkString
val write_recs = write_pat.findAllIn(log_text).matchData.map( m=> m.subgroups(0) ).take(1).mkString
val reject_recs = reject_pat.findAllIn(log_text).matchData.map( m=> m.subgroups(0) ).take(1).mkString
val log_summ = List("Read",read_recs,"Write",write_recs,"Reject",reject_recs).sliding(2,2).map( p => p match { case List(x,y) => (x,y)}).toMap
结果是
log_summ: scala.collection.immutable.Map[String,String] = Map(Read -> 120, Write -> 100, Reject -> 20)
不知怎的,我觉得,我正在以 roundabout/redundant 的方式来做。有没有更好的方法来完成这个?。
我觉得不错。只有三件事需要改进:
1) IntelliJ
是你的朋友。它立即给你两个意图:
m.subgroups(0)
->m.subgroups.head
map(p => p match { case List(x, y) => (x, y) })
->map { case List(x, y) => (x, y) }
2) 干燥。不要重复read/write/reject相关代码三次。只要把它放在某个地方一次。例如:
case class Processor(name: String, patternString: String) {
lazy val pattern: Regex = patternString.r
}
val processors = Seq(
Processor("Read", """(\d+) (records) (.*)"""),
Processor("Write", """(?s)records .*? (\d+) (records)(.*)"""),
Processor("Reject", """(?s).* (\d+) (records)"""),
)
def read_recs(processor: Processor) = processor.pattern.findAllIn(log_text).matchData.map(m => m.subgroups.head).take(1).mkString
3) List[Tuple2]
可以通过简单的 toMap
Map
val log_summ = processors.map(processor => processor.name -> read_recs(processor)).toMap
鉴于 read/write/reject 文本的相似性,您可以将多个 Regex 匹配模式简化为一个通用模式,并使用 zip
生成您的 Map
,如下所示:
val pattern = """(\d+) records .*""".r
val keys = List("Read", "Write", "Reject")
val values = pattern.findAllIn(log_text).matchData.map(_.subgroups(0)).toList
// values: List[String] = List(120, 100, 20)
val log_summ = (keys zip values).toMap
// log_summ: scala.collection.immutable.Map[String,String] =
// Map(Read -> 120, Write -> 100, Reject -> 20)
如果您愿意将日志的措辞用于 Map
键,则可以一次性完成。
val Pattern = raw"(\d+) records .*\) ([^|]+)".r.unanchored
log_text.split("\n").flatMap{
case Pattern(num, typ) => Some(typ -> num)
case _ => None
}.toMap
//res0: immutable.Map[String,String] = Map(read -> 120, written -> 100, rejected -> 20)