替换 spark DataFrame 中的列值

Replace a column value in the spark DataFrame

能否帮我替换数据框中的列值 spark:

data = [["1", "xxx", "company 0"],
        ["2", "xxx", "company 1"],
        ["3", "company 44", "company 2"],
        ["4", "xxx", "company 1"],
        ["5", "bobby", "company 1"]]


dataframe = spark.createDataFrame(data)

我正在尝试将“公司”替换为“cmp”。 “公司”可以在不同的栏目中找到。

因为“公司”可能出现在任何列中,所以您必须遍历每一列并将 regex_replace 应用于每一列:

from pyspark.sql import functions as F

cols = dataframe.columns

for c in cols:
    dataframe = dataframe.withColumn(c, F.regexp_replace(c, 'company', 'cmp'))

+---+------+-----+
| _1|    _2|   _3|
+---+------+-----+
|  1|   xxx|cmp 0|
|  2|   xxx|cmp 1|
|  3|cmp 44|cmp 2|
|  4|   xxx|cmp 1|
|  5| bobby|cmp 1|
+---+------+-----+

函数式编程方法

from functools import reduce
from pyspark.sql import functions as F
cols = dataframe.columns
reduce(lambda dataframe, c: dataframe.withColumn(c, F.regexp_replace(c, 'company', 'cmp')), cols, dataframe).show()