如何将字符串以外的任何数据类型转换为pyspark数据框中的字符串

Question

我正在尝试对两个数据帧中的每一行应用 pyspark sql 函数哈希算法来识别差异。哈希算法基于字符串，因此我试图将字符串以外的任何数据类型转换为字符串。我在日期列转换中遇到了大部分问题，因为在转换为字符串之前需要更改日期格式以使其与基于哈希的一致matching.Please帮助我解决这个问题。

#Identify the fields which are not strings
from pyspark.sql.types import *
fields = df_db1.schema.fields
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))

#Convert the date fields to specific date format and convert to string.
DateFields = map(lambda f: col(f.name), filter(lambda f: isistance(f.dataType, DateType), fields))

#convert all other fields other than string to string.

Answer 1

对于数字和日期字段，您可以使用 cast

#filter rows
DateFields = filter(lambda f: isinstance(f.dataType, DateType), fields)

# cast to string
dateFieldsWithCast = map(lambda f: col(f).cast("string").as(f.name), DateFields)

以类似的方式，您可以创建 Long 类型等的列列表，然后像中那样执行 select answer

如何将字符串以外的任何数据类型转换为pyspark数据框中的字符串

How to convert any datatype other than string to string in pyspark dataframe

python-3.x

apache-spark

pyspark

spark-dataframe

pyspark-sql