镶木地板上的配置单元外部 table 未获取数据

Question

我正在尝试创建一个数据管道，其中将输入数据存储到镶木地板中，我创建了外部配置单元 table 并且用户可以查询配置单元 table 并检索数据。我能够保存镶木地板数据并直接检索它但是当我查询配置单元 table 它不返回任何行。我做了以下测试设置

--创建外部 HIVE TABLE 创建外部 table emp ( 身份证双倍， hire_dt 时间戳，用户字符串 ) 存储为镶木地板位置“/test/emp”；

现在在一些数据上创建数据框并保存到 parquet 中。

---创建数据框并插入数据

val employeeDf = Seq(("1", "2018-01-01","John"),("2","2018-12-01", "Adam")).toDF("id","hire_dt","user")
val schema = List(("id", "double"), ("hire_dt", "date"), ("user", "string"))
val newCols= schema.map ( x => col(x._1).cast(x._2)) 
val newDf = employeeDf.select(newCols:_*)
newDf.write.mode("append").parquet("/test/emp")
newDf.show 

--read the contents directly from parquet 
val sqlcontext=new org.apache.spark.sql.SQLContext(sc)
sqlcontext.read.parquet("/test/emp").show 

+---+----------+----+
| id|   hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+

--read from the external hive table 
spark.sql("select  id,hire_dt,user from  emp").show(false)

+---+-------+----+
|id |hire_dt|user|
+---+-------+----+
+---+-------+----+

如上所示，如果我直接从 parquet 而不是从 hive 读取，我能够看到数据。问题是我在这里做错了什么？我做错了什么是蜂巢没有获取数据。我认为 msck 修复可能是一个原因，但如果我尝试进行 msck 修复 table 说 table 未分区，我会收到错误消息。

Answer 1

根据您的 create table 语句，您将位置用作 /test/emp，但在写入数据时，您正在写入 /tenants/gwm/idr/emp。所以你不会在 /test/emp.

处有数据

CREATE EXTERNAL HIVE TABLE 创建外部 table emp ( id double, hire_dt timestamp, user string ) stored as parquet location '/test/emp';

请重新创建外部 table 为

CREATE EXTERNAL HIVE TABLE 创建外部 table emp ( id double, hire_dt timestamp, user string ) stored as parquet location '/tenants/gwm/idr/emp';

Answer 2

除了下面Ramdev给出的答案，你还需要注意在date/timestamp附近使用正确的数据类型；因为在创建配置单元 table 时，parquet 不支持“date”类型。

为此，您可以将“hire_dt”列的“date”类型更改为“timestamp”。

否则，您通过 spark 持久化并尝试在 hive（或 hive SQL）中读取的数据将不匹配。在两个地方都保持 'timestamp' 将解决问题。我希望它有所帮助。

Answer 3

您的 sparkSession builder() 语句中是否包含 enableHiveSupport()。你能连接到配置单元元存储吗？尝试在您的代码中执行 show tables/databases 以查看是否可以显示存在于您的配置单元位置的表？

Answer 4

我用下面的 chgn 得到了这个。

val dfTransformed = employeeDf.withColumn("id", employeeDf.col("id").cast(DoubleType))
            .withColumn("hire_dt", employeeDf.col("hire_dt".cast(TimestampType))

所以基本上问题是数据类型不匹配，而且转换的原始代码似乎无法正常工作。所以我做了一个显式转换然后写它很好并且能够查询回来因为well.Logically两者都在做同样的事情不知道为什么原始代码不起作用。

val employeeDf = Seq(("1", "2018-01-01","John"),("2","2018-12-01", "Adam")).toDF("id","hire_dt","user")

val dfTransformed = employeeDf.withColumn("id", employeeDf.col("id").cast(DoubleType))
    .withColumn("hire_dt", employeeDf.col("hire_dt".cast(TimestampType))

dfTransformed.write.mode("append").parquet("/test/emp")
dfTransformed.show 

--read the contents directly from parquet 
val sqlcontext=new org.apache.spark.sql.SQLContext(sc)
sqlcontext.read.parquet("/test/emp").show 

+---+----------+----+
| id|   hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+

--read from the external hive table 
spark.sql("select  id,hire_dt,user from  emp").show(false)
+---+----------+----+
| id|   hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+

镶木地板上的配置单元外部 table 未获取数据

hive external table on parquet not fetching data

hive

hiveql

apache-spark

parquet

apache-spark-sql