Spark 读取 CSV 文件 - 列值以数字开头并以 D/F 结尾

Question

我用spark读取一个csv文件，csv中有一个字段值是91520122094491671D.
读取后值为9.152012209449166....
我发现如果一个字符串以数字开头，以D/F结尾，就会是这个结果
但我需要将数据作为字符串读取。
那我该怎么办呢？

这是CSV文件数据。

tax_file_code|  cus_name|   tax_identification_number

T19915201|  息烽家吉装饰材料店|  91520122094491671D

Scala代码如下：

sparkSession.read.format("com.databricks.spark.csv")
  .option("header", "true") 
  .option("inferSchema", true.toString) 
  .load(getHadoopUri(uri)) 
  .createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds_tmp")

sparkSession.sql(
  s"""
     |  select  cast(tax_file_code as String) as tax_file_code,
     |          cus_name,
     |          cast(tax_identification_number as String) as tax_identification_number
     |  from    t_datacent_cus_temp_guizhou_ds_tmp
  """.stripMargin).createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds")

sparkSession.sql("select * from t_datacent_cus_temp_guizhou_ds").show

结果如下图

+-----------------+-----------------+-------------------------+

|tax_file_code    | cus_name        |tax_identification_number|

+-----------------+-----------------+-------------------------+

|    T19915201    |息烽家吉装饰材料店 |     9.152012209449166...|

+-----------------+-----------------+-------------------------+

Answer 1

你能试试吗:

sparkSession.sql("select * from t_datacent_cus_temp_guizhou_ds").show(20, False)

通过将截断设置为假。如果为真，超过 20 个字符的字符串将被截断，所有单元格将右对齐

编辑：

 val x = sparkSession.read
        .option("header", "true")
        .option("inferSchema", "true")
        .csv("....src/main/resources/data.csv")

  x.printSchema()

  x.createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds_tmp")


      sparkSession.sql(
        s"""
           |  select  cast(tax_file_code as String) as tax_file_code,
           |          cus_name,
           |          cast(tax_identification_number as String) as tax_identification_number
           |  from    t_datacent_cus_temp_guizhou_ds_tmp
  """.stripMargin).createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds")

      sparkSession.sql("select * from t_datacent_cus_temp_guizhou_ds").show(truncate = false)

这将输出为：

+-------------+----------+-------------------------+
|tax_file_code|cus_name  |tax_identification_number|
+-------------+----------+-------------------------+
|T19915201    | 息烽家吉装饰材料店|9.1520122094491664E16    |
+-------------+----------+-------------------------+

Answer 2

听起来尾随的 D / F 正在将架构解释器设置为双精度或浮点数，并且列被截断，因此您看到的是指数值

如果您希望所有列都是字符串，请删除

option("inferSchema", true.toString)

Spark 读取 CSV 文件 - 列值以数字开头并以 D/F 结尾

Spark reading a CSV file - the column value starts with a number and ends with D/F

csv

apache-spark

spark-dataframe