Spark 读取 CSV 文件 - 列值以数字开头并以 D/F 结尾
Spark reading a CSV file - the column value starts with a number and ends with D/F
我用spark读取一个csv文件,csv中有一个字段值是91520122094491671D
.
读取后值为9.152012209449166...
.
我发现如果一个字符串以数字开头,以D/F结尾,就会是这个结果
但我需要将数据作为字符串读取。
那我该怎么办呢?
这是CSV文件数据。
tax_file_code| cus_name| tax_identification_number
T19915201| 息烽家吉装饰材料店| 91520122094491671D
Scala代码如下:
sparkSession.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", true.toString)
.load(getHadoopUri(uri))
.createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds_tmp")
sparkSession.sql(
s"""
| select cast(tax_file_code as String) as tax_file_code,
| cus_name,
| cast(tax_identification_number as String) as tax_identification_number
| from t_datacent_cus_temp_guizhou_ds_tmp
""".stripMargin).createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds")
sparkSession.sql("select * from t_datacent_cus_temp_guizhou_ds").show
结果如下图
+-----------------+-----------------+-------------------------+
|tax_file_code | cus_name |tax_identification_number|
+-----------------+-----------------+-------------------------+
| T19915201 |息烽家吉装饰材料店 | 9.152012209449166...|
+-----------------+-----------------+-------------------------+
你能试试吗:
sparkSession.sql("select * from t_datacent_cus_temp_guizhou_ds").show(20, False)
通过将截断设置为假。如果为真,超过 20 个字符的字符串将
被截断,所有单元格将右对齐
编辑:
val x = sparkSession.read
.option("header", "true")
.option("inferSchema", "true")
.csv("....src/main/resources/data.csv")
x.printSchema()
x.createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds_tmp")
sparkSession.sql(
s"""
| select cast(tax_file_code as String) as tax_file_code,
| cus_name,
| cast(tax_identification_number as String) as tax_identification_number
| from t_datacent_cus_temp_guizhou_ds_tmp
""".stripMargin).createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds")
sparkSession.sql("select * from t_datacent_cus_temp_guizhou_ds").show(truncate = false)
这将输出为:
+-------------+----------+-------------------------+
|tax_file_code|cus_name |tax_identification_number|
+-------------+----------+-------------------------+
|T19915201 | 息烽家吉装饰材料店|9.1520122094491664E16 |
+-------------+----------+-------------------------+
听起来尾随的 D / F 正在将架构解释器设置为双精度或浮点数,并且列被截断,因此您看到的是指数值
如果您希望所有列都是字符串,请删除
option("inferSchema", true.toString)
我用spark读取一个csv文件,csv中有一个字段值是91520122094491671D
.
读取后值为9.152012209449166...
.
我发现如果一个字符串以数字开头,以D/F结尾,就会是这个结果
但我需要将数据作为字符串读取。
那我该怎么办呢?
这是CSV文件数据。
tax_file_code| cus_name| tax_identification_number
T19915201| 息烽家吉装饰材料店| 91520122094491671D
Scala代码如下:
sparkSession.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", true.toString)
.load(getHadoopUri(uri))
.createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds_tmp")
sparkSession.sql(
s"""
| select cast(tax_file_code as String) as tax_file_code,
| cus_name,
| cast(tax_identification_number as String) as tax_identification_number
| from t_datacent_cus_temp_guizhou_ds_tmp
""".stripMargin).createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds")
sparkSession.sql("select * from t_datacent_cus_temp_guizhou_ds").show
结果如下图
+-----------------+-----------------+-------------------------+
|tax_file_code | cus_name |tax_identification_number|
+-----------------+-----------------+-------------------------+
| T19915201 |息烽家吉装饰材料店 | 9.152012209449166...|
+-----------------+-----------------+-------------------------+
你能试试吗:
sparkSession.sql("select * from t_datacent_cus_temp_guizhou_ds").show(20, False)
通过将截断设置为假。如果为真,超过 20 个字符的字符串将 被截断,所有单元格将右对齐
编辑:
val x = sparkSession.read
.option("header", "true")
.option("inferSchema", "true")
.csv("....src/main/resources/data.csv")
x.printSchema()
x.createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds_tmp")
sparkSession.sql(
s"""
| select cast(tax_file_code as String) as tax_file_code,
| cus_name,
| cast(tax_identification_number as String) as tax_identification_number
| from t_datacent_cus_temp_guizhou_ds_tmp
""".stripMargin).createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds")
sparkSession.sql("select * from t_datacent_cus_temp_guizhou_ds").show(truncate = false)
这将输出为:
+-------------+----------+-------------------------+
|tax_file_code|cus_name |tax_identification_number|
+-------------+----------+-------------------------+
|T19915201 | 息烽家吉装饰材料店|9.1520122094491664E16 |
+-------------+----------+-------------------------+
听起来尾随的 D / F 正在将架构解释器设置为双精度或浮点数,并且列被截断,因此您看到的是指数值
如果您希望所有列都是字符串,请删除
option("inferSchema", true.toString)