在单个 spark 数据帧中减去两个字符串列的最佳 PySpark 实践是什么？

Question

假设我有一个 spark 数据框，如下所示：

data	A	Expected_column= data - A
https://example1.org/path/to/file?param=42#fragment	param=42#fragment	https://example1.org/path/to/file?
https://example2.org/path/to/file	NaN	https://example2.org/path/to/file

我在想是否有一种合适的过滤机制，可以将两个 string 列彼此相减，例如：

sdf1 = sdf.withColumn('Expected_column', ( sdf['data'] - sdf['A'] ))

这 returns Null 用于第 Expected_column 列的所有行。我检查了像这样的不同解决方案 , but they are dealing with two dataframe while my case is within a single data frame as well as their issues are not dealing with string columns. The closest question was about ，这又不是我的情况。

Answer 1

您要找的函数叫做 replace:

from pyspark.sql import functions as F

sdf.withColumn("data - A", F.expr("replace(data, coalesce(A, ''), '')")).show(
    truncate=False
)
+---------------------------------------------------+-----------------+----------------------------------+
|data                                               |A                |data - A                          |
+---------------------------------------------------+-----------------+----------------------------------+
|https://example1.org/path/to/file?param=42#fragment|param=42#fragment|https://example1.org/path/to/file?|
|https://example2.org/path/to/file                  |null             |https://example2.org/path/to/file |
+---------------------------------------------------+-----------------+----------------------------------+

在单个 spark 数据帧中减去两个字符串列的最佳 PySpark 实践是什么？

What is the best PySpark practice to subtract two string columns within a single spark dataframe?

replace

substring

pyspark