Scala - 单元测试 Column 类型的函数

Scala - unit testing a Column type function

我有一个函数 isJSON() return 比较列类型。

  def isJSON( element: Column ): Column = {
    element.contains("{") && element.contains("}")
  }

这是我通常使用它的方式,它按预期工作:

df.withColumn("is_json", isJSON( col("data") ))

我正在尝试使用 FunSpec 编写单元测试,但我无法断言 Column 类型的数据。

describe("isJSON()") {
  it("should return false if data is not JSON") {
    val df = Seq( "Not a JSON" ).toDF( "data" )
    assert( isJSON( df("data") ).equals( lit( false ) ))
  }
}

单元测试错误,堆栈跟踪如下:

ScalaTestFailureLocation: com.mhedu.common.datalake.DatalakeFunSpecTest$$anonfun$$anonfun$apply$mcV$sp at (DatalakeFunSpecTest.scala:29)
org.scalatest.exceptions.TestFailedException: datalake.this.`package`.isJSON(df.apply("data")).equals(org.apache.spark.sql.functions.lit(false)) was false
    at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
    at org.scalatest.FunSpec.newAssertionFailedException(FunSpec.scala:1626)
    at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
    at com.mhedu.common.datalake.DatalakeFunSpecTest$$anonfun$$anonfun$apply$mcV$sp.apply$mcV$sp(DatalakeFunSpecTest.scala:29)
    at com.mhedu.common.datalake.DatalakeFunSpecTest$$anonfun$$anonfun$apply$mcV$sp.apply(DatalakeFunSpecTest.scala:23)
    at com.mhedu.common.datalake.DatalakeFunSpecTest$$anonfun$$anonfun$apply$mcV$sp.apply(DatalakeFunSpecTest.scala:23)
    at org.scalatest.Transformer$$anonfun$apply.apply$mcV$sp(Transformer.scala:22)
    at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
    at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
    at org.scalatest.Transformer.apply(Transformer.scala:22)
    at org.scalatest.Transformer.apply(Transformer.scala:20)
    at org.scalatest.FunSpecLike$$anon.apply(FunSpecLike.scala:422)
    at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
    at org.scalatest.FunSpec.withFixture(FunSpec.scala:1626)
    at org.scalatest.FunSpecLike$class.invokeWithFixture(FunSpecLike.scala:419)
    at org.scalatest.FunSpecLike$$anonfun$runTest.apply(FunSpecLike.scala:431)
    at org.scalatest.FunSpecLike$$anonfun$runTest.apply(FunSpecLike.scala:431)
    at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
    at org.scalatest.FunSpecLike$class.runTest(FunSpecLike.scala:431)
    at com.mhedu.common.datalake.DatalakeFunSpecTest.org$scalatest$BeforeAndAfter$$super$runTest(DatalakeFunSpecTest.scala:13)
    at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
    at com.mhedu.common.datalake.DatalakeFunSpecTest.runTest(DatalakeFunSpecTest.scala:13)
    at org.scalatest.FunSpecLike$$anonfun$runTests.apply(FunSpecLike.scala:464)
    at org.scalatest.FunSpecLike$$anonfun$runTests.apply(FunSpecLike.scala:464)
    at org.scalatest.SuperEngine$$anonfun$traverseSubNodes.apply(Engine.scala:413)
    at org.scalatest.SuperEngine$$anonfun$traverseSubNodes.apply(Engine.scala:401)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.scalatest.SuperEngine.traverseSubNodes(Engine.scala:401)
    at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:390)
    at org.scalatest.SuperEngine$$anonfun$traverseSubNodes.apply(Engine.scala:427)
    at org.scalatest.SuperEngine$$anonfun$traverseSubNodes.apply(Engine.scala:401)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.scalatest.SuperEngine.traverseSubNodes(Engine.scala:401)
    at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
    at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
    at org.scalatest.FunSpecLike$class.runTests(FunSpecLike.scala:464)
    at org.scalatest.FunSpec.runTests(FunSpec.scala:1626)
    at org.scalatest.Suite$class.run(Suite.scala:1424)
    at org.scalatest.FunSpec.org$scalatest$FunSpecLike$$super$run(FunSpec.scala:1626)
    at org.scalatest.FunSpecLike$$anonfun$run.apply(FunSpecLike.scala:468)
    at org.scalatest.FunSpecLike$$anonfun$run.apply(FunSpecLike.scala:468)
    at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
    at org.scalatest.FunSpecLike$class.run(FunSpecLike.scala:468)
    at com.mhedu.common.datalake.DatalakeFunSpecTest.org$scalatest$BeforeAndAfter$$super$run(DatalakeFunSpecTest.scala:13)
    at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
    at com.mhedu.common.datalake.DatalakeFunSpecTest.run(DatalakeFunSpecTest.scala:13)
    at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
    at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun.apply(Runner.scala:2563)
    at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun.apply(Runner.scala:2557)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)
    at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter.apply(Runner.scala:1044)
    at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter.apply(Runner.scala:1043)
    at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:2722)
    at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1043)
    at org.scalatest.tools.Runner$.run(Runner.scala:883)
    at org.scalatest.tools.Runner.run(Runner.scala)
    at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:138)
    at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28)

有什么方法可以为 Column 类型编写断言或以某种方式提取布尔值列的原始值并进行比较?

您正在测试两个 Column 实例的相等性;这些实例 相等 - 如果应用于您的 DF,它们会产生相同的结果,但它们不相等(很容易将它们应用于不同的 DF 并得到不同的结果).

一种测试方法是 filter DataFrame,条件是这两个 ColumnisJSONlit(true) 的结果)相等, 然后断言结果的大小为 0:

describe("isJSON()") {
  it("should return false if data is not JSON") {
    val df = Seq("Not a JSON").toDF( "data" )
    assert(df.filter(isJSON(df("data")) === lit(true)).count() == 0)
  }
}

另一种选择是收集计算此列的结果,并断言所有结果都是 false,例如:

describe("isJSON()") {
  it("should return false if data is not JSON") {
    val df = Seq("Not a JSON").toDF( "data" )
    val results: Array[Boolean] = df.select(isJSON(df("data"))).collect().map { case Row(b: Boolean) => b }
    assert(results sameElements Array(false))
  }
}

还有许多其他类似的选项,这里的重要概念是比较 data 而不是 Column 对象——只要断言表达式中比较的类型是列,您不是在比较实际结果。