将单引号添加到 dataFrame 列值
Add single quotes to the dataFrame column values
DataFrame
持有一列 QUALIFY
,其值如下所示。
QUALIFY
=================
ColA|ColB|ColC
ColA
ColZ|ColP
此列中的值被 "|"
拆分。我希望此列中的值类似于 'ColA','ColB','ColC' ...
使用下面的代码,我可以用 ,',
替换 |
。如何在值的开头和结尾添加单引号?
newDf = df_qualify.withColumn('QUALIFY2', regexp_replace('QUALIFY', "\|", "\','"))
拆分 |
上的列,然后将生成的数组连接回字符串:
import pyspark.sql.functions as F
import pyspark.sql.types as T
def str_list(x):
return str(x).replace("[", "").replace("]", "")
str_udf = F.udf(str_list, T.StringType())
df = df.withColumn("arr_split", F.split(F.col("QUALIFY"), "\|")) # escape character
df = df.withColumn("QUALIFY2", str_udf(F.col("arr_split")))
我的示例输出帧:
df.drop("arr_split").show() # Please ignore a and b columns
+---+---+--------------+--------------------+
| a| b| abc| QUALIFY2|
+---+---+--------------+--------------------+
| 1| 1|col1|col2|col3|'col1', 'col2', '...|
| 2| 2|col1|col2|col3|'col1', 'col2', '...|
| 3| 3|col1|col2|col3|'col1', 'col2', '...|
| 4| 4|col1|col2|col3|'col1', 'col2', '...|
| 5| 5|col1|col2|col3|'col1', 'col2', '...|
+---+---+--------------+--------------------+
您的解决方案就差不多了 - 您只需要在开始和结束时添加一个单引号。您可以使用 pyspark.sql.functions.concat
:
来实现
from pyspark.sql.functions import col, concat, lit, regexp_replace
df.withColumn(
"QUALIFY2",
concat(lit("'"), regexp_replace(col('QUALIFY'), r"\|", r"','"), lit("'"))
).show()
#+--------------+--------------------+
#| QUALIFY| QUALIFY2|
#+--------------+--------------------+
#|ColA|ColB|ColC|'ColA','ColB','ColC'|
#| ColA| 'ColA'|
#| ColZ|ColP| 'ColZ','ColP'|
#+--------------+--------------------+
或者,您可以避免使用正则表达式并使用 split
and concat_ws
:
实现相同的效果
from pyspark.sql.functions import split, concat_ws
df.withColumn(
"QUALIFY2",
concat(lit("'"), concat_ws("','", split("QUALIFY", "\|")), lit("'"))
).show()
#+--------------+--------------------+
#| QUALIFY| QUALIFY2|
#+--------------+--------------------+
#|ColA|ColB|ColC|'ColA','ColB','ColC'|
#| ColA| 'ColA'|
#| ColZ|ColP| 'ColZ','ColP'|
#+--------------+--------------------+
DataFrame
持有一列 QUALIFY
,其值如下所示。
QUALIFY
=================
ColA|ColB|ColC
ColA
ColZ|ColP
此列中的值被 "|"
拆分。我希望此列中的值类似于 'ColA','ColB','ColC' ...
使用下面的代码,我可以用 ,',
替换 |
。如何在值的开头和结尾添加单引号?
newDf = df_qualify.withColumn('QUALIFY2', regexp_replace('QUALIFY', "\|", "\','"))
拆分 |
上的列,然后将生成的数组连接回字符串:
import pyspark.sql.functions as F
import pyspark.sql.types as T
def str_list(x):
return str(x).replace("[", "").replace("]", "")
str_udf = F.udf(str_list, T.StringType())
df = df.withColumn("arr_split", F.split(F.col("QUALIFY"), "\|")) # escape character
df = df.withColumn("QUALIFY2", str_udf(F.col("arr_split")))
我的示例输出帧:
df.drop("arr_split").show() # Please ignore a and b columns
+---+---+--------------+--------------------+
| a| b| abc| QUALIFY2|
+---+---+--------------+--------------------+
| 1| 1|col1|col2|col3|'col1', 'col2', '...|
| 2| 2|col1|col2|col3|'col1', 'col2', '...|
| 3| 3|col1|col2|col3|'col1', 'col2', '...|
| 4| 4|col1|col2|col3|'col1', 'col2', '...|
| 5| 5|col1|col2|col3|'col1', 'col2', '...|
+---+---+--------------+--------------------+
您的解决方案就差不多了 - 您只需要在开始和结束时添加一个单引号。您可以使用 pyspark.sql.functions.concat
:
from pyspark.sql.functions import col, concat, lit, regexp_replace
df.withColumn(
"QUALIFY2",
concat(lit("'"), regexp_replace(col('QUALIFY'), r"\|", r"','"), lit("'"))
).show()
#+--------------+--------------------+
#| QUALIFY| QUALIFY2|
#+--------------+--------------------+
#|ColA|ColB|ColC|'ColA','ColB','ColC'|
#| ColA| 'ColA'|
#| ColZ|ColP| 'ColZ','ColP'|
#+--------------+--------------------+
或者,您可以避免使用正则表达式并使用 split
and concat_ws
:
from pyspark.sql.functions import split, concat_ws
df.withColumn(
"QUALIFY2",
concat(lit("'"), concat_ws("','", split("QUALIFY", "\|")), lit("'"))
).show()
#+--------------+--------------------+
#| QUALIFY| QUALIFY2|
#+--------------+--------------------+
#|ColA|ColB|ColC|'ColA','ColB','ColC'|
#| ColA| 'ColA'|
#| ColZ|ColP| 'ColZ','ColP'|
#+--------------+--------------------+