为几列创建具有空值的DataFrame
Create DataFrame with null value for few column
我正在尝试使用 RDD
创建一个 DataFrame
。
首先,我使用以下代码创建了一个 RDD
-
val account = sc.parallelize(Seq(
(1, null, 2,"F"),
(2, 2, 4, "F"),
(3, 3, 6, "N"),
(4,null,8,"F")))
工作正常 -
account: org.apache.spark.rdd.RDD[(Int, Any, Int, String)] =
ParallelCollectionRDD[0] at parallelize at :27
但是当尝试使用以下代码从 RDD
创建 DataFrame
时
account.toDF("ACCT_ID", "M_CD", "C_CD","IND")
我低于错误
java.lang.UnsupportedOperationException: Schema for type Any is not
supported
我分析过,每当我将 null
值放入 Seq
时,只有我得到错误。
有没有办法添加空值?
问题是 Any 类型太笼统,Spark 不知道如何序列化它。您应该明确提供一些特定类型,在您的情况下 Integer
。由于在 Scala 中不能将 null 分配给基本类型,因此您可以使用 java.lang.Integer
代替。所以试试这个:
val account = sc.parallelize(Seq(
(1, null.asInstanceOf[Integer], 2,"F"),
(2, new Integer(2), 4, "F"),
(3, new Integer(3), 6, "N"),
(4, null.asInstanceOf[Integer],8,"F")))
这是一个输出:
rdd: org.apache.spark.rdd.RDD[(Int, Integer, Int, String)] = ParallelCollectionRDD[0] at parallelize at <console>:24
以及对应的DataFrame:
scala> val df = rdd.toDF("ACCT_ID", "M_CD", "C_CD","IND")
df: org.apache.spark.sql.DataFrame = [ACCT_ID: int, M_CD: int ... 2 more fields]
scala> df.show
+-------+----+----+---+
|ACCT_ID|M_CD|C_CD|IND|
+-------+----+----+---+
| 1|null| 2| F|
| 2| 2| 4| F|
| 3| 3| 6| N|
| 4|null| 8| F|
+-------+----+----+---+
您还可以考虑一些更简洁的方法来声明空整数值,例如:
object Constants {
val NullInteger: java.lang.Integer = null
}
不使用 RDD 的替代方法:
import spark.implicits._
val df = spark.createDataFrame(Seq(
(1, None, 2, "F"),
(2, Some(2), 4, "F"),
(3, Some(3), 6, "N"),
(4, None, 8, "F")
)).toDF("ACCT_ID", "M_CD", "C_CD","IND")
df.show
+-------+----+----+---+
|ACCT_ID|M_CD|C_CD|IND|
+-------+----+----+---+
| 1|null| 2| F|
| 2| 2| 4| F|
| 3| 3| 6| N|
| 4|null| 8| F|
+-------+----+----+---+
df.printSchema
root
|-- ACCT_ID: integer (nullable = false)
|-- M_CD: integer (nullable = true)
|-- C_CD: integer (nullable = false)
|-- IND: string (nullable = true)
我正在尝试使用 RDD
创建一个 DataFrame
。
首先,我使用以下代码创建了一个 RDD
-
val account = sc.parallelize(Seq(
(1, null, 2,"F"),
(2, 2, 4, "F"),
(3, 3, 6, "N"),
(4,null,8,"F")))
工作正常 -
account: org.apache.spark.rdd.RDD[(Int, Any, Int, String)] = ParallelCollectionRDD[0] at parallelize at :27
但是当尝试使用以下代码从 RDD
创建 DataFrame
时
account.toDF("ACCT_ID", "M_CD", "C_CD","IND")
我低于错误
java.lang.UnsupportedOperationException: Schema for type Any is not supported
我分析过,每当我将 null
值放入 Seq
时,只有我得到错误。
有没有办法添加空值?
问题是 Any 类型太笼统,Spark 不知道如何序列化它。您应该明确提供一些特定类型,在您的情况下 Integer
。由于在 Scala 中不能将 null 分配给基本类型,因此您可以使用 java.lang.Integer
代替。所以试试这个:
val account = sc.parallelize(Seq(
(1, null.asInstanceOf[Integer], 2,"F"),
(2, new Integer(2), 4, "F"),
(3, new Integer(3), 6, "N"),
(4, null.asInstanceOf[Integer],8,"F")))
这是一个输出:
rdd: org.apache.spark.rdd.RDD[(Int, Integer, Int, String)] = ParallelCollectionRDD[0] at parallelize at <console>:24
以及对应的DataFrame:
scala> val df = rdd.toDF("ACCT_ID", "M_CD", "C_CD","IND")
df: org.apache.spark.sql.DataFrame = [ACCT_ID: int, M_CD: int ... 2 more fields]
scala> df.show
+-------+----+----+---+
|ACCT_ID|M_CD|C_CD|IND|
+-------+----+----+---+
| 1|null| 2| F|
| 2| 2| 4| F|
| 3| 3| 6| N|
| 4|null| 8| F|
+-------+----+----+---+
您还可以考虑一些更简洁的方法来声明空整数值,例如:
object Constants {
val NullInteger: java.lang.Integer = null
}
不使用 RDD 的替代方法:
import spark.implicits._
val df = spark.createDataFrame(Seq(
(1, None, 2, "F"),
(2, Some(2), 4, "F"),
(3, Some(3), 6, "N"),
(4, None, 8, "F")
)).toDF("ACCT_ID", "M_CD", "C_CD","IND")
df.show
+-------+----+----+---+
|ACCT_ID|M_CD|C_CD|IND|
+-------+----+----+---+
| 1|null| 2| F|
| 2| 2| 4| F|
| 3| 3| 6| N|
| 4|null| 8| F|
+-------+----+----+---+
df.printSchema
root
|-- ACCT_ID: integer (nullable = false)
|-- M_CD: integer (nullable = true)
|-- C_CD: integer (nullable = false)
|-- IND: string (nullable = true)