读取带有隐式小数点的固定长度文件?
Read fixed length file with implicit decimal point?
假设我有这样一个数据文件:
foo12345
bar45612
我想将其解析为:
+----+-------+
| id| amt|
+----+-------+
| foo| 123.45|
| bar| 456.12|
+----+-------+
也就是说,我需要 select df.value.substr(4,5).alias('amt')
,但我希望该值被解释为五位数字,其中最后两位数字在小数点后。
肯定有比 "divide by 100" 更好的方法吗?
from pyspark.sql.functions import substring, concat, lit
from pyspark.sql.types import DoubleType
#sample data
df = sc.parallelize([
['foo12345'],
['bar45612']]).toDF(["value"])
df = df.withColumn('id', substring('value',1,3)).\
withColumn('amt', concat(substring('value', 4, 3),lit('.'),substring('value', 7, 2)).cast(DoubleType()))
df.show()
输出为:
+--------+---+------+
| value| id| amt|
+--------+---+------+
|foo12345|foo|123.45|
|bar45612|bar|456.12|
+--------+---+------+
假设我有这样一个数据文件:
foo12345
bar45612
我想将其解析为:
+----+-------+
| id| amt|
+----+-------+
| foo| 123.45|
| bar| 456.12|
+----+-------+
也就是说,我需要 select df.value.substr(4,5).alias('amt')
,但我希望该值被解释为五位数字,其中最后两位数字在小数点后。
肯定有比 "divide by 100" 更好的方法吗?
from pyspark.sql.functions import substring, concat, lit
from pyspark.sql.types import DoubleType
#sample data
df = sc.parallelize([
['foo12345'],
['bar45612']]).toDF(["value"])
df = df.withColumn('id', substring('value',1,3)).\
withColumn('amt', concat(substring('value', 4, 3),lit('.'),substring('value', 7, 2)).cast(DoubleType()))
df.show()
输出为:
+--------+---+------+
| value| id| amt|
+--------+---+------+
|foo12345|foo|123.45|
|bar45612|bar|456.12|
+--------+---+------+