在 Apache Spark 中指定 CSV 的架构
Specifying Schema of CSV in Apache Spark
我在一个简单的案例中遇到错误:
我想读一堆CSV,格式都一样,但是没有headers。
所以,我正在尝试指定 headers。
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
schema = StructType([
StructField("c0", StringType(), True),
StructField("c1", StringType(), True),
StructField("c2", StringType(), True),
StructField("c3", TimestampType, True),
StructField("c4", TimestampType, True),
StructField("c5", StringType(), True),
StructField("c6", StringType(), True),
StructField("c7", StringType(), True),
StructField("c8", StringType(), True),
StructField("c9", StringType(), True),
StructField("c10", StringType(), True),
StructField("c11", StringType(), True),
StructField("c12", StringType(), True),
StructField("c13", StringType(), True),
StructField("c14", StringType(), True),
StructField("c15", StringType(), True),
StructField("c16", StringType(), True),
StructField("c17", StringType(), True)
])
df = sqlContext.read.load('good_loc.csv',
format='com.databricks.spark.csv',
header='false',
inferSchema='true')
我收到错误:
dataType should be DataType
Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/sql/types.py", line 403, in __init__
assert isinstance(dataType, DataType), "dataType should be DataType"
AssertionError: dataType should be DataType
我认为错误来自时间戳类型。我正在使用 Spark 2.2
感谢您的帮助!
StructField("c3", TimestampType, True),
StructField("c4", TimestampType, True),
成为
StructField("c3", TimestampType(), True),
StructField("c4", TimestampType(), True),
我在一个简单的案例中遇到错误:
我想读一堆CSV,格式都一样,但是没有headers。
所以,我正在尝试指定 headers。
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
schema = StructType([
StructField("c0", StringType(), True),
StructField("c1", StringType(), True),
StructField("c2", StringType(), True),
StructField("c3", TimestampType, True),
StructField("c4", TimestampType, True),
StructField("c5", StringType(), True),
StructField("c6", StringType(), True),
StructField("c7", StringType(), True),
StructField("c8", StringType(), True),
StructField("c9", StringType(), True),
StructField("c10", StringType(), True),
StructField("c11", StringType(), True),
StructField("c12", StringType(), True),
StructField("c13", StringType(), True),
StructField("c14", StringType(), True),
StructField("c15", StringType(), True),
StructField("c16", StringType(), True),
StructField("c17", StringType(), True)
])
df = sqlContext.read.load('good_loc.csv',
format='com.databricks.spark.csv',
header='false',
inferSchema='true')
我收到错误:
dataType should be DataType
Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/sql/types.py", line 403, in __init__
assert isinstance(dataType, DataType), "dataType should be DataType"
AssertionError: dataType should be DataType
我认为错误来自时间戳类型。我正在使用 Spark 2.2
感谢您的帮助!
StructField("c3", TimestampType, True),
StructField("c4", TimestampType, True),
成为
StructField("c3", TimestampType(), True),
StructField("c4", TimestampType(), True),