具有多个条件的 Sparksql 过滤（使用 where 子句选择）

Question

您好，我有以下问题：

numeric.registerTempTable("numeric").

我要过滤的所有值都是文字空字符串，而不是 N/A 或空值。

我尝试了这三个选项：

numeric_filtered = numeric.filter(numeric['LOW'] != 'null').filter(numeric['HIGH'] != 'null').filter(numeric['NORMAL'] != 'null')
numeric_filtered = numeric.filter(numeric['LOW'] != 'null' AND numeric['HIGH'] != 'null' AND numeric['NORMAL'] != 'null')
sqlContext.sql("SELECT * from numeric WHERE LOW != 'null' AND HIGH != 'null' AND NORMAL != 'null'")

不幸的是，numeric_filtered 总是空的。我检查了一下，numeric 有应该根据这些条件过滤的数据。

以下是一些示例值：

低高正常

3.5 5.0 空

2.0 14.0 空

空 38.0 空

null null null

1.0 空 4.0

Answer 1

您正在使用逻辑连词 (AND)。这意味着所有列都必须不同于 'null' 才能包含行。让我们以 filter 版本为例进行说明：

numeric = sqlContext.createDataFrame([
    ('3.5,', '5.0', 'null'), ('2.0', '14.0', 'null'),  ('null', '38.0', 'null'),
    ('null', 'null', 'null'),  ('1.0', 'null', '4.0')],
    ('low', 'high', 'normal'))

numeric_filtered_1 = numeric.where(numeric['LOW'] != 'null')
numeric_filtered_1.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

numeric_filtered_2 = numeric_filtered_1.where(
    numeric_filtered_1['NORMAL'] != 'null')
numeric_filtered_2.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## |1.0|null|   4.0|
## +---+----+------+

numeric_filtered_3 = numeric_filtered_2.where(
    numeric_filtered_2['HIGH'] != 'null')
numeric_filtered_3.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## +---+----+------+

您尝试过的所有剩余方法都遵循完全相同的模式。这里需要的是逻辑析取（OR）。

from pyspark.sql.functions import col 

numeric_filtered = df.where(
    (col('LOW')    != 'null') | 
    (col('NORMAL') != 'null') |
    (col('HIGH')   != 'null'))
numeric_filtered.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

或原始 SQL:

numeric.registerTempTable("numeric")
sqlContext.sql("""SELECT * FROM numeric
    WHERE low != 'null' OR normal != 'null' OR high != 'null'"""
).show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

另请参阅：

Answer 2

from pyspark.sql.functions import col, countDistinct 
totalrecordcount = df.where("ColumnName is not null").select(countDistinct("ColumnName")).collect()[0][0]

具有多个条件的 Sparksql 过滤（使用 where 子句选择）

Sparksql filtering (selecting with where clause) with multiple conditions

python

sql

apache-spark

apache-spark-sql

pyspark