根据pyspark中另一列的值拆分一列

Question

我有以下数据框

+----+-------+
|item|   path|
+----+-------+
|   a|  a/b/c|
|   b|  e/b/f|
|   d|e/b/d/h|
|   c|  g/h/c|
+----+-------+

我想找到列 "item" 的相对路径，方法是在列 'path' 中找到它的值并提取路径的 LHS，如下所示

+----+-------+--------+
|item|   path|rel_path|
+----+-------+--------+
|   a|  a/b/c|       a|
|   b|  e/b/f|     e/b|
|   d|e/b/d/h|   e/b/d|
|   c|  g/h/c|   g/h/c|
+----+-------+--------+

我尝试使用函数 split((str, pattern) 或 regexp_extract(str, pattern, idx) 但不确定如何将列 'item' 的值传递到它们的模式部分.知道如何在不编写函数的情况下完成吗？

Answer 1

您可以使用 pyspark.sql.functions.expr to to regexp_replace。在这里，您需要将 item 的负后视与 .+ 连接起来以匹配之后的所有内容，并替换为空字符串。

from pyspark.sql.functions import expr

df.withColumn(
    "rel_path", 
    expr("regexp_replace(path, concat('(?<=',item,').+'), '')")
).show()
#+----+-------+--------+
#|item|   path|rel_path|
#+----+-------+--------+
#|   a|  a/b/c|       a|
#|   b|  e/b/f|     e/b|
#|   d|e/b/d/h|   e/b/d|
#|   c|  g/h/c|   g/h/c|
#+----+-------+--------+

Answer 2

您可以使用 substring 和 instr

的组合来获得所需的结果

substring - 从 column/string

中获取子集

instr - 识别特定模式在搜索字符串中的位置。

df = spark.createDataFrame([('a','a/b/c'),
                            ('b','e/b/f'),
                            ('d','e/b/d/h'),
                            ('c','g/h/c')],'item : string , path : string')

from pyspark.sql.functions import expr, instr, substring

df.withColumn("rel_path",expr("substring(path, 1, (instr(path,item)))")).show()

##+----+-------+--------+
##|item|   path|rel_path|
##+----+-------+--------+
##|   a|  a/b/c|       a|
##|   b|  e/b/f|     e/b|
##|   d|e/b/d/h|   e/b/d|
##|   c|  g/h/c|   g/h/c|
##+----+-------+--------+

根据pyspark中另一列的值拆分一列

Split one column based the value of another column in pyspark

apache-spark

pyspark

pyspark-sql