从 pyspark 数据帧中获取多个(100+)列的空计数、最小值和最大值的最佳方法
Best way to get null counts, min and max values of multiple (100+) columns from a pyspark dataframe
假设我有一个列名列表,它们都存在于数据框中
Cols = ['A', 'B', 'C', 'D'],
我正在寻找一种快速获得table/dataframe赞
的方法
NA_counts min max
A 5 0 100
B 10 0 120
C 8 1 99
D 2 0 500
TIA
您可以单独计算每个指标,然后像这样合并所有指标:
nulls_cols = [sum(when(col(c).isNull(), lit(1)).otherwise(lit(0))).alias(c) for c in cols]
max_cols = [max(col(c)).alias(c) for c in cols]
min_cols = [min(col(c)).alias(c) for c in cols]
nulls_df = df.select(lit("NA_counts").alias("count"), *nulls_cols)
max_df = df.select(lit("Max").alias("count"), *max_cols)
min_df = df.select(lit("Min").alias("count"), *min_cols)
nulls_df.unionAll(max_df).unionAll(min_df).show()
输出示例:
+---------+---+---+----+----+
| count| A| B| C| D|
+---------+---+---+----+----+
|NA_counts| 1| 0| 3| 1|
| Max| 9| 5|Test|2017|
| Min| 1| 0|Test|2010|
+---------+---+---+----+----+
假设我有一个列名列表,它们都存在于数据框中
Cols = ['A', 'B', 'C', 'D'],
我正在寻找一种快速获得table/dataframe赞
的方法 NA_counts min max
A 5 0 100
B 10 0 120
C 8 1 99
D 2 0 500
TIA
您可以单独计算每个指标,然后像这样合并所有指标:
nulls_cols = [sum(when(col(c).isNull(), lit(1)).otherwise(lit(0))).alias(c) for c in cols]
max_cols = [max(col(c)).alias(c) for c in cols]
min_cols = [min(col(c)).alias(c) for c in cols]
nulls_df = df.select(lit("NA_counts").alias("count"), *nulls_cols)
max_df = df.select(lit("Max").alias("count"), *max_cols)
min_df = df.select(lit("Min").alias("count"), *min_cols)
nulls_df.unionAll(max_df).unionAll(min_df).show()
输出示例:
+---------+---+---+----+----+
| count| A| B| C| D|
+---------+---+---+----+----+
|NA_counts| 1| 0| 3| 1|
| Max| 9| 5|Test|2017|
| Min| 1| 0|Test|2010|
+---------+---+---+----+----+