如何将带有随机字符的新列添加到 pyspark 数据框
How to add a new column with random chars to pyspark dataframe
我正在尝试向 Spark 数据帧的每一行添加带有随机 8 字符字符串的新列。
生成8个字符字符串的函数-
def id(size=8, chars=string.ascii_lowercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
我的 Spark DF -
columns = ["Seqno","Name"]
data = [("1", "john jones"),
("2", "tracey smith"),
("3", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
df = df.withColumn("randomid", lit(id()))
df.show(truncate=False)
但是使用上面的代码,随机 id 被复制了。关于它的任何指针可以使它对每一行都是唯一的吗?
+-----+------------+--------------------------------+
|Seqno|Name |randomid |
+-----+------------+--------------------------------+
|1 |john jones |uz6iugmraripznyzizt1ymvbs8gi2qv8|
|2 |tracey smith|uz6iugmraripznyzizt1ymvbs8gi2qv8|
|3 |amy sanders |uz6iugmraripznyzizt1ymvbs8gi2qv8|
+-----+------------+--------------------------------+
可以使用uuid
函数生成一个字符串,然后替换其中的-
df = df.withColumn("randomid", F.expr('replace(uuid(), "-", "")'))
您可以使用shuffle
转换:
import string
import pyspark.sql.functions as f
source_characters = string.ascii_letters + string.digits
df = spark.createDataFrame([
("1", "john jones"),
("2", "tracey smith"),
("3", "amy sanders")
], ['Seqno', 'Name'])
df = (df
.withColumn('source_characters', f.split(f.lit(source_characters), ''))
.withColumn('random_string', f.concat_ws('', f.slice(f.shuffle(f.col('source_characters')), 1, 8)))
.drop('source_characters')
)
df.show()
输出如下:
+-----+------------+-------------+
|Seqno| Name|random_string|
+-----+------------+-------------+
| 1| john jones| f8yWABgY|
| 2|tracey smith| Xp6idNb7|
| 3| amy sanders| zU8aSN4C|
+-----+------------+-------------+
我正在尝试向 Spark 数据帧的每一行添加带有随机 8 字符字符串的新列。
生成8个字符字符串的函数-
def id(size=8, chars=string.ascii_lowercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
我的 Spark DF -
columns = ["Seqno","Name"]
data = [("1", "john jones"),
("2", "tracey smith"),
("3", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
df = df.withColumn("randomid", lit(id()))
df.show(truncate=False)
但是使用上面的代码,随机 id 被复制了。关于它的任何指针可以使它对每一行都是唯一的吗?
+-----+------------+--------------------------------+
|Seqno|Name |randomid |
+-----+------------+--------------------------------+
|1 |john jones |uz6iugmraripznyzizt1ymvbs8gi2qv8|
|2 |tracey smith|uz6iugmraripznyzizt1ymvbs8gi2qv8|
|3 |amy sanders |uz6iugmraripznyzizt1ymvbs8gi2qv8|
+-----+------------+--------------------------------+
可以使用uuid
函数生成一个字符串,然后替换其中的-
df = df.withColumn("randomid", F.expr('replace(uuid(), "-", "")'))
您可以使用shuffle
转换:
import string
import pyspark.sql.functions as f
source_characters = string.ascii_letters + string.digits
df = spark.createDataFrame([
("1", "john jones"),
("2", "tracey smith"),
("3", "amy sanders")
], ['Seqno', 'Name'])
df = (df
.withColumn('source_characters', f.split(f.lit(source_characters), ''))
.withColumn('random_string', f.concat_ws('', f.slice(f.shuffle(f.col('source_characters')), 1, 8)))
.drop('source_characters')
)
df.show()
输出如下:
+-----+------------+-------------+
|Seqno| Name|random_string|
+-----+------------+-------------+
| 1| john jones| f8yWABgY|
| 2|tracey smith| Xp6idNb7|
| 3| amy sanders| zU8aSN4C|
+-----+------------+-------------+