Spark 读取 CSV 文件 - 列值以数字开头并以 D/F 结尾

Spark reading a CSV file - the column value starts with a number and ends with D/F

我用spark读取一个csv文件,csv中有一个字段值是91520122094491671D.
读取后值为9.152012209449166....
我发现如果一个字符串以数字开头,以D/F结尾,就会是这个结果
但我需要将数据作为字符串读取。
那我该怎么办呢?

这是CSV文件数据。

tax_file_code|  cus_name|   tax_identification_number

T19915201|  息烽家吉装饰材料店|  91520122094491671D

Scala代码如下:

sparkSession.read.format("com.databricks.spark.csv")
  .option("header", "true") 
  .option("inferSchema", true.toString) 
  .load(getHadoopUri(uri)) 
  .createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds_tmp")

sparkSession.sql(
  s"""
     |  select  cast(tax_file_code as String) as tax_file_code,
     |          cus_name,
     |          cast(tax_identification_number as String) as tax_identification_number
     |  from    t_datacent_cus_temp_guizhou_ds_tmp
  """.stripMargin).createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds")

sparkSession.sql("select * from t_datacent_cus_temp_guizhou_ds").show

结果如下图

+-----------------+-----------------+-------------------------+

|tax_file_code    | cus_name        |tax_identification_number|

+-----------------+-----------------+-------------------------+

|    T19915201    |息烽家吉装饰材料店 |     9.152012209449166...|

+-----------------+-----------------+-------------------------+

你能试试吗:

sparkSession.sql("select * from t_datacent_cus_temp_guizhou_ds").show(20, False)

通过将截断设置为假。如果为真,超过 20 个字符的字符串将 被截断,所有单元格将右对齐

编辑:

 val x = sparkSession.read
        .option("header", "true")
        .option("inferSchema", "true")
        .csv("....src/main/resources/data.csv")

  x.printSchema()

  x.createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds_tmp")


      sparkSession.sql(
        s"""
           |  select  cast(tax_file_code as String) as tax_file_code,
           |          cus_name,
           |          cast(tax_identification_number as String) as tax_identification_number
           |  from    t_datacent_cus_temp_guizhou_ds_tmp
  """.stripMargin).createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds")

      sparkSession.sql("select * from t_datacent_cus_temp_guizhou_ds").show(truncate = false)

这将输出为:

+-------------+----------+-------------------------+
|tax_file_code|cus_name  |tax_identification_number|
+-------------+----------+-------------------------+
|T19915201    | 息烽家吉装饰材料店|9.1520122094491664E16    |
+-------------+----------+-------------------------+

听起来尾随的 D / F 正在将架构解释器设置为双精度或浮点数,并且列被截断,因此您看到的是指数值

如果您希望所有列都是字符串,请删除

option("inferSchema", true.toString)