从集合中随机替换 spark 数据集列值
replacing spark dataset column values randomly from a set
有一个数据集 imputedcsv,我想在其中用男性或女性随机替换性别列中的空值..
imputedcsv.groupBy("Gender").count.show()
+------+-----+
|Gender|count|
+------+-----+
| null| 24|
|Female| 240|
| Male| 242|
+------+-----+
用单个值填充空值,但是如何从一组值中随机填充列的空值说 {Male,Female}
imputedcsv.na.fill("Male", Seq("Gender")).groupBy("Gender").count.show()
+------+-----+
|Gender|count|
+------+-----+
|Female| 240|
| Male| 266|
+------+-----+
我需要用 Male
或 Female
随机填充它,而不是仅用一个值 Male
替换空值。
类似于使用 sample(c('Male','Female'))
对于单个值,我们有
感谢任何帮助。
如果您认为性别是 Female
或 Male
的概率相等,您可以这样做:
df.withColumn( "gender",
coalesce($"gender",
when(round(rand).cast("int") === lit(0) , lit("Male") )
.otherwise(lit("Female"))
)).show
coalesce
使其仅适用于 null
值。
round(rand).cast("int")
每次都会生成 0
或 1
,然后 Male
或 Female
将由 when - otherwise
构造决定。
您可以通过使用 when & otherwise
和 withColumn
来实现它,如下所示:
scala> df.groupBy("Gender").count.show
+------+-----+
|Gender|count|
+------+-----+
| null| 2|
|female| 4|
| male| 4|
+------+-----+
scala> df.withColumn("gender", when(($"gender".isNull), "male").otherwise($"gender")).groupBy("gender").count.show
+------+-----+
|gender|count|
+------+-----+
|female| 4|
| male| 6|
+------+-----+
我错过了randomly
,你可以像下面这样实现:
scala> val gender_set = Set("male","female")
gender_set: scala.collection.immutable.Set[String] = Set(male, female)
scala> import scala.util.Random
import scala.util.Random
scala> val rnd=new Random
rnd: scala.util.Random = scala.util.Random@668b5a55
scala> df.withColumn("gender", when(($"gender".isNull), gender_set.toVector(rnd.nextInt(gender_set.size))).otherwise($"gender")).groupBy("gender").count.show
+------+-----+
|gender|count|
+------+-----+
|female| 4|
| male| 6|
+------+-----+
scala> df.withColumn("gender", when(($"gender".isNull), gender_set.toVector(rnd.nextInt(gender_set.size))).otherwise($"gender")).groupBy("gender").count.show
+------+-----+
|gender|count|
+------+-----+
|female| 6|
| male| 4|
+------+-----+
谢谢。
我需要将@Learner 的代码放入 UDF 中才能工作,否则会出错。
df.groupBy($"Gender").count.show()
+------+-----+
|Gender|count|
+------+-----+
| null| 3|
|Female| 3|
| Male| 2|
+------+-----+
val gender_set = Set("Male","Female")
val randGenderUDF = udf(() =>
gender_set.toVector(rnd.nextInt(gender_set.size))
)
df.withColumn("Gender", when($"Gender".isNull, randGenderUDF()).otherwise($"Gender")).groupBy($"Gender").count.show()
+------+-----+
|Gender|count|
+------+-----+
|Female| 5|
| Male| 3|
+------+-----+
有一个数据集 imputedcsv,我想在其中用男性或女性随机替换性别列中的空值..
imputedcsv.groupBy("Gender").count.show()
+------+-----+
|Gender|count|
+------+-----+
| null| 24|
|Female| 240|
| Male| 242|
+------+-----+
用单个值填充空值,但是如何从一组值中随机填充列的空值说 {Male,Female}
imputedcsv.na.fill("Male", Seq("Gender")).groupBy("Gender").count.show()
+------+-----+
|Gender|count|
+------+-----+
|Female| 240|
| Male| 266|
+------+-----+
我需要用 Male
或 Female
随机填充它,而不是仅用一个值 Male
替换空值。
类似于使用 sample(c('Male','Female'))
对于单个值,我们有
感谢任何帮助。
如果您认为性别是 Female
或 Male
的概率相等,您可以这样做:
df.withColumn( "gender",
coalesce($"gender",
when(round(rand).cast("int") === lit(0) , lit("Male") )
.otherwise(lit("Female"))
)).show
coalesce
使其仅适用于 null
值。
round(rand).cast("int")
每次都会生成 0
或 1
,然后 Male
或 Female
将由 when - otherwise
构造决定。
您可以通过使用 when & otherwise
和 withColumn
来实现它,如下所示:
scala> df.groupBy("Gender").count.show
+------+-----+
|Gender|count|
+------+-----+
| null| 2|
|female| 4|
| male| 4|
+------+-----+
scala> df.withColumn("gender", when(($"gender".isNull), "male").otherwise($"gender")).groupBy("gender").count.show
+------+-----+
|gender|count|
+------+-----+
|female| 4|
| male| 6|
+------+-----+
我错过了randomly
,你可以像下面这样实现:
scala> val gender_set = Set("male","female")
gender_set: scala.collection.immutable.Set[String] = Set(male, female)
scala> import scala.util.Random
import scala.util.Random
scala> val rnd=new Random
rnd: scala.util.Random = scala.util.Random@668b5a55
scala> df.withColumn("gender", when(($"gender".isNull), gender_set.toVector(rnd.nextInt(gender_set.size))).otherwise($"gender")).groupBy("gender").count.show
+------+-----+
|gender|count|
+------+-----+
|female| 4|
| male| 6|
+------+-----+
scala> df.withColumn("gender", when(($"gender".isNull), gender_set.toVector(rnd.nextInt(gender_set.size))).otherwise($"gender")).groupBy("gender").count.show
+------+-----+
|gender|count|
+------+-----+
|female| 6|
| male| 4|
+------+-----+
谢谢。
我需要将@Learner 的代码放入 UDF 中才能工作,否则会出错。
df.groupBy($"Gender").count.show()
+------+-----+
|Gender|count|
+------+-----+
| null| 3|
|Female| 3|
| Male| 2|
+------+-----+
val gender_set = Set("Male","Female")
val randGenderUDF = udf(() =>
gender_set.toVector(rnd.nextInt(gender_set.size))
)
df.withColumn("Gender", when($"Gender".isNull, randGenderUDF()).otherwise($"Gender")).groupBy($"Gender").count.show()
+------+-----+
|Gender|count|
+------+-----+
|Female| 5|
| Male| 3|
+------+-----+