无法使用 pyspark 加载镶木地板文件(不支持的镶木地板类型:INT32(UINT_8);)
cannot load parquet file (Parquet type not supported: INT32 (UINT_8);) with pyspark
我正在尝试加载存储在 hadoop 中的镶木地板文件。
这是我的 table:
name type
----------------
ID BIGINT
point SMALLINT
check TINYINT
我要执行的是:
df = sqlContext.read.parquet('path')
我收到了这个错误:
Caused by: org.apache.spark.sql.AnalysisException: Parquet type not supported: INT32 (UINT_8);
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.typeNotSupported(ParquetSchemaConverter.scala:101)
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:137)
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:89)
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter$$anonfun.apply(ParquetSchemaConverter.scala:68)
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter$$anonfun.apply(ParquetSchemaConverter.scala:65)
at scala.collection.TraversableLike$$anonfun$map.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetToSparkSchemaConverter$$convert(ParquetSchemaConverter.scala:65)
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:62)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readSchemaFromFooter.apply(ParquetFileFormat.scala:664)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readSchemaFromFooter.apply(ParquetFileFormat.scala:664)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:664)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun.apply(ParquetFileFormat.scala:621)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun.apply(ParquetFileFormat.scala:603)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$$anonfun$apply.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$$anonfun$apply.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun.apply(Executor.scala:407)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
我尝试解决这个问题,发现spark parquet不支持某些类型。
那有没有办法加载我的table? making new table 是唯一的方法吗?因为这个问题我花了很长时间...
Spark parquet 不支持某些类型,例如 uint。我的 table 有 uint 类型,所以就是这样。
我用这个答案解决了这个问题
首先,创建新架构:
from pyspark.sql.types import *
newSchema = StructType([ StructField("ID", LongType(), True),
StructField("point", IntegerType(), True),
StructField("check", IntegerType(), True) ])
并使用此模式打开 parquet 文件
df = hc.read.option("mergeSchema", "true").schema(newSchema).parquet(path)
对我有效。
我正在尝试加载存储在 hadoop 中的镶木地板文件。
这是我的 table:
name type
----------------
ID BIGINT
point SMALLINT
check TINYINT
我要执行的是:
df = sqlContext.read.parquet('path')
我收到了这个错误:
Caused by: org.apache.spark.sql.AnalysisException: Parquet type not supported: INT32 (UINT_8);
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.typeNotSupported(ParquetSchemaConverter.scala:101)
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:137)
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:89)
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter$$anonfun.apply(ParquetSchemaConverter.scala:68)
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter$$anonfun.apply(ParquetSchemaConverter.scala:65)
at scala.collection.TraversableLike$$anonfun$map.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetToSparkSchemaConverter$$convert(ParquetSchemaConverter.scala:65)
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:62)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readSchemaFromFooter.apply(ParquetFileFormat.scala:664)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readSchemaFromFooter.apply(ParquetFileFormat.scala:664)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:664)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun.apply(ParquetFileFormat.scala:621)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun.apply(ParquetFileFormat.scala:603)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$$anonfun$apply.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$$anonfun$apply.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun.apply(Executor.scala:407)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
我尝试解决这个问题,发现spark parquet不支持某些类型。
那有没有办法加载我的table? making new table 是唯一的方法吗?因为这个问题我花了很长时间...
Spark parquet 不支持某些类型,例如 uint。我的 table 有 uint 类型,所以就是这样。
我用这个答案解决了这个问题
首先,创建新架构:
from pyspark.sql.types import *
newSchema = StructType([ StructField("ID", LongType(), True),
StructField("point", IntegerType(), True),
StructField("check", IntegerType(), True) ])
并使用此模式打开 parquet 文件
df = hc.read.option("mergeSchema", "true").schema(newSchema).parquet(path)
对我有效。