Python - 如果 DOB 和 Address1 以及 Address2 和 PostCode 为 NULL，则将行移动到新数据框作为 Badrecord

Question

我正在尝试将所有 4 列 DOB、Address1、address2 和 Postcode 中具有 NULL 值的行移动到一个新的数据框中，并使用干净的记录保留原始数据集

我试过使用下面的代码解决它

import numpy as np
import pandas as pd
BadRecords = Data.dropna(subset=['DOB','Address1','Address2','PostCode'], how='any') 
print(BadRecords)

当前代码正在打印整个数据集。它应该只过滤 DOB、Address1、Address2 和 postcode 全部 4 都是 NULLs

的记录

Answer 1

要获取具有空值的记录，您可以像这样过滤原始集：

from pyspark.sql.functions import col, isnull
badRecords = Data.filter(isnull(col('DOB')) & isnull(col('Address1')) & isnull(col('Address2')) & isnull(col('PostCode')))
display(badRecords)

dropna 函数 returns 一个新的数据框省略了空值的行，所以你只能得到 "good" 条记录

goodRecords = Data.dropna(subset=['DOB','Address1','Address2','PostCode'], how='all')

另请注意，how='any' 将删除至少有一个值为 null 的行，因此如果您只想在所有行都为 null 时过滤行，则需要使用 'all' 设置。

Python - 如果 DOB 和 Address1 以及 Address2 和 PostCode 为 NULL，则将行移动到新数据框作为 Badrecord

Python - Move the rows to new dataframe as Badrecord if DOB and Address1 and Address2 and PostCode have NULL

python-2.7

apache-spark-sql

azure-databricks