在单个 spark 数据帧中减去两个字符串列的最佳 PySpark 实践是什么?

What is the best PySpark practice to subtract two string columns within a single spark dataframe?

假设我有一个 spark 数据框,如下所示:

data A Expected_column= data - A
https://example1.org/path/to/file?param=42#fragment param=42#fragment https://example1.org/path/to/file?
https://example2.org/path/to/file NaN https://example2.org/path/to/file

我在想是否有一种合适的过滤机制,可以将两个 string 列彼此相减,例如:

sdf1 = sdf.withColumn('Expected_column', ( sdf['data'] - sdf['A'] ))

这 returns Null 用于第 Expected_column 列的所有行。我检查了像这样的不同解决方案 , but they are dealing with two dataframe while my case is within a single data frame as well as their issues are not dealing with string columns. The closest question was about ,这又不是我的情况。

您要找的函数叫做 replace:

from pyspark.sql import functions as F

sdf.withColumn("data - A", F.expr("replace(data, coalesce(A, ''), '')")).show(
    truncate=False
)
+---------------------------------------------------+-----------------+----------------------------------+
|data                                               |A                |data - A                          |
+---------------------------------------------------+-----------------+----------------------------------+
|https://example1.org/path/to/file?param=42#fragment|param=42#fragment|https://example1.org/path/to/file?|
|https://example2.org/path/to/file                  |null             |https://example2.org/path/to/file |
+---------------------------------------------------+-----------------+----------------------------------+