Jay Taylor's notes

back to listing index

[HIVE-4160] Vectorized Query Execution in Hive - ASF JIRA

[web search]
Original source (issues.apache.org)
Tags: vectorized-execution hive issues.apache.org
Clipped on: 2016-05-23

Image (Asset 2/11) alt=
  1. Hive
  2. HIVE-4160

Vectorized Query Execution in Hive


  • Type: Image (Asset 3/11) alt= New Feature
  • Status: Open
  • Priority: Image (Asset 4/11) alt= Major
  • Resolution: Unresolved
  • Affects Version/s: None
  • Fix Version/s: None
  • Component/s: None
  • Labels:


The Hive query execution engine currently processes one row at a time. A single row of data goes through all the operators before the next row can be processed. This mode of processing is very inefficient in terms of CPU usage. Research has demonstrated that this yields very low instructions per cycle [MonetDB X100]. Also currently Hive heavily relies on lazy deserialization and data columns go through a layer of object inspectors that identify column type, deserialize data and determine appropriate expression routines in the inner loop. These layers of virtual method calls further slow down the processing.

This work will add support for vectorized query execution to Hive, where, instead of individual rows, batches of about a thousand rows at a time are processed. Each column in the batch is represented as a vector of a primitive data type. The inner loop of execution scans these vectors very fast, avoiding method calls, deserialization, unnecessary if-then-else, etc. This substantially reduces CPU time used, and gives excellent instructions per cycle (i.e. improved processor pipeline utilization). See the attached design specification for more details.


  1. Hive-Vectorized-Query-Execution-Design.docx
    18/Mar/13 23:40
    33 kB
  2. Hive-Vectorized-Query-Execution-Design-rev10.docx
    25/Jul/13 18:56
    41 kB
  3. Hive-Vectorized-Query-Execution-Design-rev10.docx
    25/Jul/13 18:53
    41 kB
  4. Hive-Vectorized-Query-Execution-Design-rev10.pdf
    25/Jul/13 18:55
    665 kB
  5. Hive-Vectorized-Query-Execution-Design-rev11.docx
    17/Sep/13 23:14
    42 kB
  6. Hive-Vectorized-Query-Execution-Design-rev11.pdf
    17/Sep/13 23:14
    671 kB
  7. Hive-Vectorized-Query-Execution-Design-rev2.docx
    06/Apr/13 01:00
    31 kB
  8. Hive-Vectorized-Query-Execution-Design-rev3.docx
    10/Apr/13 18:29
    32 kB
  9. Hive-Vectorized-Query-Execution-Design-rev3.docx
    10/Apr/13 18:22
    32 kB
  10. Hive-Vectorized-Query-Execution-Design-rev3.pdf
    10/Apr/13 18:29
    596 kB
  11. Hive-Vectorized-Query-Execution-Design-rev4.docx
    26/Apr/13 18:56
    32 kB
  12. Hive-Vectorized-Query-Execution-Design-rev4.pdf
    26/Apr/13 18:57
    596 kB
  13. Hive-Vectorized-Query-Execution-Design-rev5.docx
    09/May/13 22:29
    34 kB
  14. Hive-Vectorized-Query-Execution-Design-rev5.pdf
    09/May/13 22:29
    609 kB
  15. Hive-Vectorized-Query-Execution-Design-rev6.docx
    09/May/13 22:47
    34 kB
  16. Hive-Vectorized-Query-Execution-Design-rev6.pdf
    09/May/13 22:47
    609 kB
  17. Hive-Vectorized-Query-Execution-Design-rev7.docx
    13/May/13 18:59
    35 kB
  18. Hive-Vectorized-Query-Execution-Design-rev8.docx
    28/May/13 22:49
    36 kB
  19. Hive-Vectorized-Query-Execution-Design-rev8.pdf
    28/May/13 22:50
    651 kB
  20. Hive-Vectorized-Query-Execution-Design-rev9.docx
    25/Jun/13 21:22
    39 kB
  21. Hive-Vectorized-Query-Execution-Design-rev9.pdf
    25/Jun/13 21:22
    657 kB
  22. .

    Issue Links


HIVE-5584 Write initial user documentation for vectorized query on Hive Wiki

  • Resolved
relates to

HIVE-10179 Optimization for SIMD instructions in Hive

  • Open


Implement vectorized logical expressions. Resolved Jitendra Nath Pandey  
Implement vectorized column-scalar expressions Resolved Jitendra Nath Pandey  
Implement class for vectorized row batch Resolved Eric Hanson  
Implement classes for column vectors. Resolved Eric Hanson  
Change ORC tree readers to return batches of rows instead of a row Resolved Sarvesh Sakalanaga  
Implement Vectorized Column-Column expressions Resolved Jitendra Nath Pandey  
Implement Vectorized Scalar-Column expressions Resolved Eric Hanson  
Implement vectorized aggregation expressions Resolved Remus Rusanu  
Implement vectorized string column-scalar filters Resolved Eric Hanson  
Implement vectorized string functions UPPER(), LOWER(), LENGTH() Resolved Eric Hanson  
Implement vectorized LIKE filter Resolved Eric Hanson  
Vectorized filter and select operators Resolved Jitendra Nath Pandey  
Generate vectorized execution plan Resolved Jitendra Nath Pandey  
Vectorized expression for unary minus. Resolved Jitendra Nath Pandey  
Implement vectorized string concatenation Resolved Eric Hanson  
Extend Vector Aggregates to support GROUP BY Resolved Remus Rusanu  
Add support for string column type vector aggregates: COUNT, MIN and MAX Resolved Remus Rusanu  
Add support for COUNT(*) in vector aggregates Resolved Remus Rusanu  
Input format to read vector data from ORC Resolved Jitendra Nath Pandey  
Support partitioned tables in vectorized query execution. Resolved Jitendra Nath Pandey  
Queries not supported by vectorized code path should fall back to non vector path. Resolved Jitendra Nath Pandey  
set isRepeating to false by default in ColumnArithmeticColumn.txt Resolved Eric Hanson  
Finish support for modulo (%) operator for vectorized arithmetic Resolved Eric Hanson  
Add unit tests for vectorized IS NULL and IS NOT NULL filters Resolved Jitendra Nath Pandey  
Extend plan vectorization to cover GroupByOperator Resolved Remus Rusanu  
OR, NOT Filter logic can lose an array, and always takes time O(VectorizedRowBatch.DEFAULT_SIZE) Resolved Jitendra Nath Pandey  
Improvement in logical expressions and checkstyle fixes. Resolved Jitendra Nath Pandey  
remove redundant copy of arithmetic filter unit test testColOpScalarNumericFilterNullAndRepeatingLogic Resolved Eric Hanson  
In ORC, add boolean noNulls flag to column stripe metadata Closed Prasanth Jayachandran  
Child expressions are not being evaluated hierarchically in a few templates. Resolved Jitendra Nath Pandey  
Implement partition support for vectorized query execution Resolved Sarvesh Sakalanaga  
Vectorized row batch should be initialized with additional columns to hold intermediate output. Resolved Jitendra Nath Pandey  
Template file VectorUDAFAvg.txt missing from public branch; CodeGen.java fails Resolved Remus Rusanu  
Input format to read vector data from RC file Resolved Sarvesh Sakalanaga  
Implement vectorized filter for string column compared to string column Resolved Eric Hanson  
Implement vectorized string substr Resolved Timothy Chen  
Integer division should be cast to double. Resolved Jitendra Nath Pandey  
Vectorized reader support for Byte Boolean and Timestamp. Resolved Sarvesh Sakalanaga  
The vectorized plan is not picking right expression class for string concatenation. Resolved Eric Hanson  
Handle constants in projection Resolved Jitendra Nath Pandey  
Add partition support for vectorized ORC Input format Resolved Sarvesh Sakalanaga  
vectorized NotCol operation does not handle short-circuit evaluation for NULL propagation correctly Resolved Jitendra Nath Pandey  
IsNotNull and NotCol incorrectly handle nulls. Resolved Jitendra Nath Pandey  
select * fails on orc table when vectorization is enabled Resolved Sarvesh Sakalanaga  
only explicit int type works e2e. tiny,small, and big all fail with: org.apache.hadoop.hive.ql.metadata.HiveException: Unsuported JIT vectorization column type Resolved Tony Murphy  
Move test utils and fix build to remove false test failures Resolved Tony Murphy  
Run check-style on the branch and fix style issues. Resolved Jitendra Nath Pandey  
VectorizedRowBatchCtx::CreateVectorizedRowBatch should create only the projected columns and not all columns Resolved Sarvesh Sakalanaga  
Speed up vectorized LIKE filter for special cases abc%, %abc and %abc% Resolved Teddy Choi  
Vectorized RecordReader for ORC does not set the ColumnVector.IsRepeating correctly Resolved Sarvesh Sakalanaga  
Column Column, and Column Scalar vectorized execution tests Resolved Tony Murphy  
In place filtering in Not Filter doesn't handle nulls correctly. Resolved Jitendra Nath Pandey  
fix failure to set output isNull to true and other NULL propagation issues; update arithmetic tests Resolved Eric Hanson  
Support strings in GROUP BY keys Resolved Remus Rusanu  
Fix serialization exceptions in VectorGroupByOperator Resolved Remus Rusanu  
Remove test code from ql\src\java tree, place it itn ql\src\test tree Resolved Tony Murphy  
VectorGroupByOperator steals the non-vectorized children and crashes query if vectorization fails Resolved Jitendra Nath Pandey  
Vectorized reader support for timestamp in ORC. Resolved Sarvesh Sakalanaga  
Enable running all hive e2e tests under vectorization Resolved Tony Murphy  
VectorSelectOperator projections change the index of columns for subsequent operators. Resolved Jitendra Nath Pandey  
61. Cleanup column type dependencies in vectorization aggregate code Open Remus Rusanu  
Implement vector group by hash spill Resolved Remus Rusanu  
63. Support DISTINCT in vectorized aggregates Open Remus Rusanu  
Vectorized UDFs for Timestamp in nanoseconds Resolved Gopal V  
Vectorized aggregates do not emit proper rows in presence of GROUP BY Resolved Remus Rusanu  
66. Improve cache friendliness of VectorHashKeyWrapper Open Remus Rusanu  
Integrate Vectorized Substr into Vectorized QE Resolved Eric Hanson  
Fix VectorUDAFSum.txt to honor the expected vector column type Resolved Remus Rusanu  
CommonOrcInputFormat should be the default input format for Orc tables. Resolved Sarvesh Sakalanaga  
Implement vectorized RLIKE and REGEXP filter expressions Resolved Teddy Choi  
Unit test failure in TestColumnScalarOperationVectorExpressionEvaluation Resolved Jitendra Nath Pandey  
TestVectorGroupByOperator causes asserts in StandardStructObjectInspector.init Resolved Remus Rusanu  
VectorHashKeyWrapperBatch.java should be in vector package (instead of exec) Resolved Remus Rusanu  
Favor serde2.io Writable classes over hadoop.io ones Resolved Remus Rusanu  
75. Remove unused org.apache.hadoop.hive.ql.exec Writables Open Unassigned  
Vectorization not working with negative constants, hive doesn't fold constants. Resolved Jitendra Nath Pandey  
77. Implement vectorized text reader to read vectorized data from Text file Patch Available Sarvesh Sakalanaga  
78. Support Hive specific DISTRIBUTE BY clause in VectorGroupByOperator Open Remus Rusanu  
error at VectorExecMapper.close in group-by-agg query over ORC, vectorized Resolved Jitendra Nath Pandey  
Count(*) over tpch lineitem ORC results in Error: Java heap space Resolved Sarvesh Sakalanaga  
tpch query 1 fails with java.lang.ClassCastException Resolved Jitendra Nath Pandey  
wrong results for query with modulo (%) in WHERE clause filter Resolved Sarvesh Sakalanaga  
Use VectorExpessionWriter to write column vectors into Writables. Resolved Jitendra Nath Pandey  
84. Optimize COUNT(*) aggregate over vectorized ORC execution path Open Unassigned  
second clause of AND, OR filter not applied for vectorized execution Resolved Jitendra Nath Pandey  
second clause of OR filter not applied in vectorized query execution Resolved Jitendra Nath Pandey  
Fix ORC TimestampTreeReader.nextVector() to handle milli-nano math corectly Resolved Gopal V  
Query with filter constant on left of "=" and column expression on right does not vectorize Resolved Jitendra Nath Pandey  
query using LIKE does not vectorize Resolved Eric Hanson  
Max on float returning wrong results Resolved Remus Rusanu  
incorrect result for max aggregate over int column Resolved Remus Rusanu  
NPE in writing null values. Resolved Jitendra Nath Pandey  
Unit test failure in TestColumnColumnOperationVectorExpressionEvaluation Resolved Eric Hanson  
Fix ORC TestVectorizedORCReader testcase for Timestamps Resolved Gopal V  
Integrate basic UDFs for Timesamp Resolved Gopal V  
Optimize filter Column IN ( list-of-constants ) for vectorized execution Resolved Unassigned  
Unit test failure TestVectorSelectOperator Resolved Jitendra Nath Pandey  
TestCase FakeVectorRowBatchFromObjectIterables error Resolved Eric Hanson  
Query on Table with partition columns fail with AlreadyBeingCreatedException Resolved Sarvesh Sakalanaga  
Vectorized Sum of scalar subtract column returns negative result when positive exected Resolved Jitendra Nath Pandey  
Classcast exception with two group by keys of types string and tinyint. Resolved Remus Rusanu  
array out of bounds exception near VectorHashKeyWrapper.getBytes() with 2 column GROUP BY Resolved Remus Rusanu  
MIN on timestamp column gives incorrect result. Resolved Gopal V  
Optimize ORC StringTreeReader::nextVector to not create dictionary of strings for each call to nextVector Resolved Sarvesh Sakalanaga  
105. Float aggregate of single value loses precission Open Remus Rusanu  
Unary Minus Expression Throwing java.lang.NullPointerException Resolved Jitendra Nath Pandey  
java.lang.RuntimeException: Hive Runtime Error while closing operators: java.lang.ClassCastException: org.apache.hadoop.io.NullWritable cannot be cast to org.apache.hadoop.hive.serde2.io.DoubleWritable Resolved Jitendra Nath Pandey  
OrcInputFormat should be enhanced to provide vectorized input. Resolved Jitendra Nath Pandey  
NULLs and record separators broken with vectorization branch intermediate outputs Resolved Gopal V  
Vectorized ORC reader does not handle absence of column present stream correctly. Resolved Sarvesh Sakalanaga  
Null Pointer Exception in Group By Operator Resolved Jitendra Nath Pandey  
Hive Runtime Error while closing operators: java.lang.NullPointerException Resolved Remus Rusanu  
Incorrect aggregate results Resolved Remus Rusanu  
make vectorized LOWER(), UPPER(), LENGTH() work end-to-end; support expression input for vectorized LIKE Resolved Eric Hanson  
Unit e2e tests for vectorization Resolved Tony Murphy  
Implement vectorized type casting for all types Resolved Eric Hanson  
implement vectorized math functions Resolved Eric Hanson  
implement vectorized TRIM(), LTRIM(), RTRIM() Resolved Eric Hanson  
Make vectorization branch compile under JDK 7 Resolved Ashutosh Chauhan  
Implement Vectorized Limit Operator Resolved Sarvesh Sakalanaga  
std, stddev and stddev_pop aggregates on double/float fail to vectorize Resolved Remus Rusanu  
Implement vectorized JOIN operators Resolved Remus Rusanu  
String column comparison classes should be renamed. Resolved Jitendra Nath Pandey  
ORC TimestampTreeReader.nextVector() off by a second when time in fractional Resolved Gopal V  
make vectorized math functions work end-to-end (update VectorizationContext.java) Resolved Eric Hanson  
Vectorized ORC reader does not set isRepeating flag correctly when 1’s are present is the input stream Resolved Sarvesh Sakalanaga  
create template for string scalar compared with string column Resolved Eric Hanson  
MAX/MIN aggregates yield incorrect results Resolved Remus Rusanu  
Make RLIKE/REGEXP run end-to-end by updating VectorizationContext Resolved Teddy Choi  
130. Allow prevention of string column re-use for string functions that can set results by reference Open Unassigned  
Vectorized plan generation should be added as an optimization transform. Resolved Jitendra Nath Pandey  
Create bridge for custom UDFs to operate in vectorized mode Resolved Eric Hanson  
Unit test failure in TestVectorTimestampExpressions Resolved Gopal V  
Consolidate and simplify vectorization code and test generation Resolved Tony Murphy  
Make vector expressions serializable. Resolved Jitendra Nath Pandey  
FilterExprOrExpr changes the order of the rows Resolved Jitendra Nath Pandey  
Vector operators should inherit from non-vector operators for code re-use. Resolved Jitendra Nath Pandey  
Enhance explain to indicate vectorized execution of operators. Resolved Jitendra Nath Pandey  
orc_create.q and other orc tests fail on the branch. Resolved Jitendra Nath Pandey  
The code generation should be part of the build process. Resolved Jitendra Nath Pandey  
Update hive-default.xml.template for vectorization flag; remove unused imports from MetaStoreUtils.java Resolved Jitendra Nath Pandey  
Commit vectorization test data, comment/rename vectorization tests. Resolved Tony Murphy  
Boolean constants in the query are not handled correctly. Resolved Jitendra Nath Pandey  
144. VectorizedRowBatch member variables are public. Reopened Jitendra Nath Pandey  
145. Follow convention for placing modifiers in variable declaration. Open Jitendra Nath Pandey  
146. Avoid catching Throwable and converting them to exceptions. Open Jitendra Nath Pandey  
Refactor VectorizationContext and handle NOT expression with nulls. Resolved Jitendra Nath Pandey  
Vectorization throws exception with nested UDF. Resolved Jitendra Nath Pandey  
TopN optimization in VectorReduceSink Resolved Sergey Shelukhin  
Implement end-to-end tests for vectorized string and math functions, and casts Resolved Eric Hanson  
Vectorized query failing for partitioned tables. Resolved Jitendra Nath Pandey  
152. Handle virtual columns and schema evolution in vector code path Open Matt McCline  
Implement vectorized year/month/day... etc. for string arguments Resolved Teddy Choi  
Implement BETWEEN filter in vectorized mode Resolved Eric Hanson  
Implement support for IN (list-of-constants) filter in vectorized mode Resolved Eric Hanson  
Write initial user documentation for vectorized query on Hive Wiki Resolved Eric Hanson  
Exception in vectorized map join. Resolved Jitendra Nath Pandey  
Implement vectorized SMB JOIN Resolved Remus Rusanu


Fix validation of nested expressions. Resolved Jitendra Nath Pandey  
Exception in UDFs with large number of arguments. Resolved Jitendra Nath Pandey  
Vectorized Shuffle Join produces incorrect results Resolved Remus Rusanu  
162. Supported UDFs should have a separate annotation to indicate they are vectorizable. Open Jitendra Nath Pandey  
Validation doesn't catch SMBMapJoin Resolved Jitendra Nath Pandey  
Intermediate columns are incorrectly initialized for partitioned tables. Resolved Jitendra Nath Pandey  
Add unit test for vectorized BETWEEN for timestamp inputs Resolved Eric Hanson  
166. Implement support for BETWEEN in SELECT list Patch Available Navis  
Implement vectorization support for IF conditional expression for long, double, timestamp, boolean and string inputs Resolved Eric Hanson  
Implement vectorized support for CASE Resolved Eric Hanson  
Implement vectorized support for NOT IN filter Resolved Eric Hanson  
Implement vectorized support for COALESCE conditional expression Resolved Jitendra Nath Pandey  
Implement vectorized support for the DATE data type Resolved Teddy Choi  
172. Implement vectorized support for the DECIMAL data type In Progress Eric Hanson  
Implement vectorization support for IF conditional expression for boolean and timestamp inputs Resolved Eric Hanson  
Implement vectorization support for IF conditional expression for string inputs Resolved Eric Hanson  
175. query fails in vectorized mode on empty partitioned table Open Unassigned  
Implement vectorized support for IN as boolean-valued expression Resolved Eric Hanson  
Implement vectorized support for CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END Resolved Unassigned  
Rollups not supported in vector mode. Resolved Jitendra Nath Pandey  
Failure in cast to timestamps. Resolved Jitendra Nath Pandey  
Add vectorized reader for Parquet files Closed Remus Rusanu  
Contribute Decimal128 high-performance decimal(p, s) package from Microsoft to Hive Resolved Eric Hanson  
Create DecimalColumnVector and a representative VectorExpression for decimal Resolved Eric Hanson  
Implement vectorized decimal comparison filters Resolved Eric Hanson  
Support basic Decimal arithmetic in vector mode (+, -, *) Resolved Eric Hanson  
Implement vectorized decimal division and modulo Resolved Eric Hanson  
Implement vectorized reader for Date datatype for ORC format. Resolved Jitendra Nath Pandey  
Implement vectorized reader for DECIMAL datatype for ORC format. Resolved Jitendra Nath Pandey  
Implement vectorized type cast from/to decimal(p, s) Resolved Eric Hanson  
error in vectorized Column-Column comparison filter for repeating case Resolved Eric Hanson  
Make Vector Group By operator abandon grouping if too many distinct keys Resolved Remus Rusanu  
191. Implement fast vectorized InputFormat extension for text files Open Eric Hanson  
error in high-precision division for Decimal128 Resolved Eric Hanson  
Add more unit tests for high-precision Decimal128 arithmetic Resolved Eric Hanson  
VectorExpressionWriter for date and decimal datatypes. Resolved Jitendra Nath Pandey  
Generate vectorized plan for decimal expressions. Resolved Jitendra Nath Pandey  
Add DECIMAL support to vectorized group by operator Resolved Remus Rusanu  
Add DECIMAL support to vectorized JOIN operators Resolved Remus Rusanu  
Column name map is broken Resolved Jitendra Nath Pandey  
199. Extend the alltypesorc test table to include DECIMAL columns Open Unassigned  
Implement vectorized unary minus for decimal Resolved Jitendra Nath Pandey  
bug in high-precision Decimal128 multiply Resolved Eric Hanson  
Vectorized mathematical functions for decimal type. Resolved Jitendra Nath Pandey  
203. fix bug in UnsignedInt128.multiplyArrays4And4To8 and revert temporary fix in Decimal128.multiplyDestructive Open Jitendra Nath Pandey  
Queries fail to Vectorize. Resolved Jitendra Nath Pandey  
205. Remove unnecessary white spaces in vectorization code Patch Available Teddy Choi  


Jitendra Nath Pandey added a comment - 13/Mar/13 19:57

This will be an incremental work in multiple phases with no regression on current system. We will publish a design/scope document very soon.
The main idea behind the proposal is to transform the execution engine to process a row batch at a time instead of a single row. The row batch will consist of column vectors and each operator will process the whole column vector at a time. The column vector will consist of array(s) of primitive types as far as possible.
The expressions will be implemented for various data types using pre-compiled templates. The appropriate expressions will be added to the operators based on data types.
A vectorized iterator interface will be implemented by the file formats to provide vectorized input to the operator tree.

Jitendra Nath Pandey added a comment - 13/Mar/13 19:58
Eric Hanson added a comment - 14/Mar/13 00:41

This is part of the Stinger initiative. http://hortonworks.com/blog/100x-faster-hive/

Jitendra Nath Pandey added a comment - 18/Mar/13 23:40

The attached document covers the outline of the design. Any comments/feedback are welcome. We will keep updating the document with more details as we include more data types, operators and expressions. We will also include the vectorized iterator design into the document.

Eric Hanson added a comment - 06/Apr/13 01:00

Added section on requirements for implementation of vectorized iterator, with respect to how to load VectorizedRowBatch object on each call to next().

Steve Loughran added a comment - 10/Apr/13 12:23

We couldn't have a copy of the doc in PDF stuck up at the same time as the editable one could we?

Eric Hanson added a comment - 10/Apr/13 18:22

Fixed a bug in example, plus made minor wording changes in introduction.

Eric Hanson added a comment - 10/Apr/13 18:29

Adding pdf of design doc per request.

Eric Hanson added a comment - 10/Apr/13 18:29

updated version # and date

Eric Hanson added a comment - 09/May/13 22:31

Updated design document with discussion of precise handling and interpretation of all-non-null (noNulls) and all identical (isRepeating) column vectors.

Also included discussion of TIMESTAMP internal vector representation as long integer number of nonseconds since the epoch.

Eric Hanson added a comment - 10/May/13 00:28

The code for this work is currently in the "vectorization" branch of the public Hive repo.

Eric Hanson added a comment - 13/May/13 18:59

Added discussion of timestamp values before the epoch (in 1970) related to HIVE-4525.

Eric Hanson added a comment - 28/May/13 22:52

Updated design spec with new section by Remus Rusanu about vectorized group-by/aggregate. I edited it a little bit and added the final paragraph on future considerations.

Dmitriy V. Ryaboy added a comment - 03/Jul/13 23:15

Hi folks,
What an incredible amount of work! Looks fantastic, looking forward to this.

It seems like the general idea of a vectorized operator is not Hive-specific. Is there any possibility of abstracting the core logic of an operator that can efficiently process a stream of data, such as what you get from ORCFile, and return the computed results?

Having such a library be available independently of Hive would allow reuse in other Hadoop ecosystem projects (Pig, Cascading, Drill, etc) without the need to reinvent the wheel, and would also bring the whole community behind optimizing one set of operators instead of continuing the existing fragmented state of the world.

The process of separating out such a library might also yield benefits in terms of winding up with a cleaner design and better abstractions (that's been my experience when going through similar exercises on other projects – I don't have any reason to think your current design is not clean or doesn't have good abstractions).

Do you have any thoughts on how this could be achieved? Does this sound like something you would be interested in? Is there something that people currently working on other projects can do to help this become a reality?

Vinod Kumar Vavilapalli added a comment - 04/Jul/13 07:28

A huge +1 to that. Having a common set of operators will be a huge win. That said, I already see that the current branch follows Hive's operator base classes, uses HiveConf etc. I believe with little effort, this can be cleaned and pulled apart into one separate maven module that everyone can use.

Some points to think about:

  • The target location of the module. The dependency graph can become un-wieldly.
  • Given the use of base Operator, OperatorDesc etc from Hive, if at all there is interest and commitment, we should do this ASAP when we only have a handful of operators.
  • Make one other project demonstrate how it can be reused across ecosystem projects, PIG will be great - just a few operators will be a great start


Eric Hanson added a comment - 08/Jul/13 21:43

Dmitry and Vinod,

What specifically do you want to do with the code once it is factored out?


Dmitriy V. Ryaboy added a comment - 08/Jul/13 22:05

I would like to provide the same vectorization benefits to Pig and similar frameworks (possibly Cascading, and maybe the Spark or Crunch guys will want to use this as well, etc).

Jitendra Nath Pandey added a comment - 11/Jul/13 18:33

Dmitry, Vinod
There is significant amount of vectorization work in expression evaluation for example, arithmetic expressions or logical expressions or aggregations etc. Many of these expressions are pretty generic and different systems are likely to have similar semantics for these. It should be possible to re-use this code with little change in pig or other systems. It will be required to use same vectorized representation of data in the processing engine to re-use these expressions, but that part of code is also generic and re-usable. I think that could be a good starting point.
However, a bunch of the vectorization work is in operator code where we have vectorized version of the hive operators. These operators are closely tied with hive semantics and implementation. Therefore, it will need some restructuring in hive code base as well to generalize these operators for re-use in other projects. Also, at this point we should be thinking more generally about a common physical layer shared between pig and hive. These languages can continue to have different logical plans but it would be desirable that they share common physical plan structure because they both use same map-reduce runtime.

Dmitriy V. Ryaboy added a comment - 11/Jul/13 20:30

I believe physical plan primitives for both Hive and Pig (and potentially others) are going to come in via Tez, as both Pig and Hive want to get off strict MR in the long-term.

I'll take a crack at extracting what's extractable. Right now Hive's UDAF reaches fairly deeply into this code, as you noted, but I think with a little restructuring this can be factored out.

Eric Hanson added a comment - 17/Sep/13 23:16

Updated design specification with new section describing the vectorized UDF adaptor (HIVE-4961).

Jitendra Nath Pandey added a comment - 01/Oct/13 18:01

Vectorization work has been committed to trunk. Going forward, all the vectorization work will happen on trunk and vectorization branch will be obsolete.

Lars Francke added a comment - 04/Oct/13 11:26

This is a huge patch and it's hard to see if it changes anything for the end user. As we'd like to keep the Wiki up-to-date it'd be great if someone could comment whether there are any configuration options besides hive.vectorized.execution.enabled or any other things that should be documented.


Eric Hanson added a comment - 04/Oct/13 16:19

I've been planning to write some user documentation for this feature. Where do you think would be a good spot in the wiki to include it?

Lefty Leverenz added a comment - 05/Oct/13 10:47

Put it in Design Docs (https://cwiki.apache.org/confluence/display/Hive/DesignDocs) until it's released. Later you can move it into the User Docs with a note about which release introduces it. You can either change the file's location in the hierarchy or leave it in place and just link to it from the User Docs section.

When it goes into User Docs, you have some choices. Does it belong on the Home page or in the Language Manual? If in the Language Manual, do you want it under DML or should it be a stand-alone doc? That depends on what you write and how you want readers to find the doc. You can always add links from other docs to make sure people find it.

Here's the Language Manual: https://cwiki.apache.org/confluence/display/Hive/LanguageManual.

Of course configuration goes here, perhaps in a subsection under Query Execution: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties. I suggest you make a section in your design doc that's formatted to match the configuration doc, so when the time comes you can just cut & paste.

Eric Hanson added a comment - 01/Nov/13 17:45


  • Assignee:
    Jitendra Nath Pandey
    Jitendra Nath Pandey
  • Votes:
    2 Vote for this issue
    53 Start watching this issue


  • Created:
    13/Mar/13 19:56
    11/Aug/15 17:29

    Time Tracking

Not Specified
Include sub-tasks
  • Powered by a free Atlassian JIRA open source license for Apache Software Foundation. Try JIRA - bug tracking software for your team. ·