HIVE：'LIMIT' on 'SELECT * from' 是如何工作的？

Question

只是想知道以下简单查询的限制如何工作

select * from T limit 100

假设tableT有1300万条记录

请问上面的查询：
1.先将1300万条全部加载到内存&只显示结果集中的100条记录 ?
2. 只加载100 & 给出100条记录的结果集

现在已经搜索它很长一段时间了，大多数页面只谈论使用 "LIMIT" 而不是 Hive 如何在幕后处理它。

感谢任何有用的回复。

Answer 1

在引擎盖下，配置单元中的 "SELECT" 发出 FETCH 任务而不是生成 mapreduce 任务。把它想象成一个 hadoop fs -get 这里要注意的一点是FETCH task works only SELECT * 如果您要 select 列提取可能不起作用。

来源：https://vcfvct.wordpress.com/2016/02/18/make-hive-query-faster-with-fetch-task/

Answer 2

If no optimizer applied, hive end up scanning entire table. But Hive optimizes this with hive.fetch.task.conversion released as part of HIVE-2925, To ease simple queries with simple conditions and not to run MR/Tez at all.

Supported values are none, minimal and more.

none: Disable hive.fetch.task.conversion (value added in Hive 0.14.0 with HIVE-8389)

minimal: SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only

more: SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)

Your question is more likely what happens when minimal or more is set. It just scans through the added files and read rows until reach leastRows() For more refer gitCode, Config and here

HIVE：'LIMIT' on 'SELECT * from' 是如何工作的？

HIVE: How does 'LIMIT' on 'SELECT * from' work under-the-hood?

memory

hadoop

hive

limit