替换 spark DataFrame 中的列值
Replace a column value in the spark DataFrame
能否帮我替换数据框中的列值 spark:
data = [["1", "xxx", "company 0"],
["2", "xxx", "company 1"],
["3", "company 44", "company 2"],
["4", "xxx", "company 1"],
["5", "bobby", "company 1"]]
dataframe = spark.createDataFrame(data)
我正在尝试将“公司”替换为“cmp”。 “公司”可以在不同的栏目中找到。
因为“公司”可能出现在任何列中,所以您必须遍历每一列并将 regex_replace
应用于每一列:
from pyspark.sql import functions as F
cols = dataframe.columns
for c in cols:
dataframe = dataframe.withColumn(c, F.regexp_replace(c, 'company', 'cmp'))
+---+------+-----+
| _1| _2| _3|
+---+------+-----+
| 1| xxx|cmp 0|
| 2| xxx|cmp 1|
| 3|cmp 44|cmp 2|
| 4| xxx|cmp 1|
| 5| bobby|cmp 1|
+---+------+-----+
函数式编程方法
from functools import reduce
from pyspark.sql import functions as F
cols = dataframe.columns
reduce(lambda dataframe, c: dataframe.withColumn(c, F.regexp_replace(c, 'company', 'cmp')), cols, dataframe).show()
能否帮我替换数据框中的列值 spark:
data = [["1", "xxx", "company 0"],
["2", "xxx", "company 1"],
["3", "company 44", "company 2"],
["4", "xxx", "company 1"],
["5", "bobby", "company 1"]]
dataframe = spark.createDataFrame(data)
我正在尝试将“公司”替换为“cmp”。 “公司”可以在不同的栏目中找到。
因为“公司”可能出现在任何列中,所以您必须遍历每一列并将 regex_replace
应用于每一列:
from pyspark.sql import functions as F
cols = dataframe.columns
for c in cols:
dataframe = dataframe.withColumn(c, F.regexp_replace(c, 'company', 'cmp'))
+---+------+-----+
| _1| _2| _3|
+---+------+-----+
| 1| xxx|cmp 0|
| 2| xxx|cmp 1|
| 3|cmp 44|cmp 2|
| 4| xxx|cmp 1|
| 5| bobby|cmp 1|
+---+------+-----+
函数式编程方法
from functools import reduce
from pyspark.sql import functions as F
cols = dataframe.columns
reduce(lambda dataframe, c: dataframe.withColumn(c, F.regexp_replace(c, 'company', 'cmp')), cols, dataframe).show()