back to listing index

[HIVE-4160] Vectorized Query Execution in Hive - ASF JIRA

[web search]
Original source (issues.apache.org)
Tags: vectorized-execution hive
Clipped on: 2016-05-23

Image (Asset 2/11) alt=
  1. Hive
  2. HIVE-4160

Vectorized Query Execution in Hive

    Details

  • Type: Image (Asset 3/11) alt= New Feature
  • Status: Open
  • Priority: Image (Asset 4/11) alt= Major
  • Resolution: Unresolved
  • Affects Version/s: None
  • Fix Version/s: None
  • Component/s: None
  • Labels:
    None

    Description

The Hive query execution engine currently processes one row at a time. A single row of data goes through all the operators before the next row can be processed. This mode of processing is very inefficient in terms of CPU usage. Research has demonstrated that this yields very low instructions per cycle [MonetDB X100]. Also currently Hive heavily relies on lazy deserialization and data columns go through a layer of object inspectors that identify column type, deserialize data and determine appropriate expression routines in the inner loop. These layers of virtual method calls further slow down the processing.

This work will add support for vectorized query execution to Hive, where, instead of individual rows, batches of about a thousand rows at a time are processed. Each column in the batch is represented as a vector of a primitive data type. The inner loop of execution scans these vectors very fast, avoiding method calls, deserialization, unnecessary if-then-else, etc. This substantially reduces CPU time used, and gives excellent instructions per cycle (i.e. improved processor pipeline utilization). See the attached design specification for more details.

Attachments

  1. Hive-Vectorized-Query-Execution-Design.docx
    18/Mar/13 23:40
    33 kB
    .
  2. Hive-Vectorized-Query-Execution-Design-rev10.docx
    25/Jul/13 18:56
    41 kB
    .
  3. Hive-Vectorized-Query-Execution-Design-rev10.docx
    25/Jul/13 18:53
    41 kB
    .
  4. Hive-Vectorized-Query-Execution-Design-rev10.pdf
    25/Jul/13 18:55
    665 kB
    .
  5. Hive-Vectorized-Query-Execution-Design-rev11.docx
    17/Sep/13 23:14
    42 kB
    .
  6. Hive-Vectorized-Query-Execution-Design-rev11.pdf
    17/Sep/13 23:14
    671 kB
    .
  7. Hive-Vectorized-Query-Execution-Design-rev2.docx
    06/Apr/13 01:00
    31 kB
    .
  8. Hive-Vectorized-Query-Execution-Design-rev3.docx
    10/Apr/13 18:29
    32 kB
    .
  9. Hive-Vectorized-Query-Execution-Design-rev3.docx
    10/Apr/13 18:22
    32 kB
    .
  10. Hive-Vectorized-Query-Execution-Design-rev3.pdf
    10/Apr/13 18:29
    596 kB
    .
  11. Hive-Vectorized-Query-Execution-Design-rev4.docx
    26/Apr/13 18:56
    32 kB
    .
  12. Hive-Vectorized-Query-Execution-Design-rev4.pdf
    26/Apr/13 18:57
    596 kB
    .
  13. Hive-Vectorized-Query-Execution-Design-rev5.docx
    09/May/13 22:29
    34 kB
    .
  14. Hive-Vectorized-Query-Execution-Design-rev5.pdf
    09/May/13 22:29
    609 kB
    .
  15. Hive-Vectorized-Query-Execution-Design-rev6.docx
    09/May/13 22:47
    34 kB
    .
  16. Hive-Vectorized-Query-Execution-Design-rev6.pdf
    09/May/13 22:47
    609 kB
    .
  17. Hive-Vectorized-Query-Execution-Design-rev7.docx
    13/May/13 18:59
    35 kB
    .
  18. Hive-Vectorized-Query-Execution-Design-rev8.docx
    28/May/13 22:49
    36 kB
    .
  19. Hive-Vectorized-Query-Execution-Design-rev8.pdf
    28/May/13 22:50
    651 kB
    .
  20. Hive-Vectorized-Query-Execution-Design-rev9.docx
    25/Jun/13 21:22
    39 kB
    .
  21. Hive-Vectorized-Query-Execution-Design-rev9.pdf
    25/Jun/13 21:22
    657 kB
    .
  22. .

    Issue Links

incorporates

HIVE-5584 Write initial user documentation for vectorized query on Hive Wiki

  • Resolved
relates to

HIVE-10179 Optimization for SIMD instructions in Hive

  • Open

Sub-Tasks

1.
Implement vectorized logical expressions. Resolved Jitendra Nath Pandey  
 
2.
Implement vectorized column-scalar expressions Resolved Jitendra Nath Pandey  
 
3.
Implement class for vectorized row batch Resolved Eric Hanson  
 
4.
Implement classes for column vectors. Resolved Eric Hanson  
 
5.
Change ORC tree readers to return batches of rows instead of a row Resolved Sarvesh Sakalanaga  
 
6.
Implement Vectorized Column-Column expressions Resolved Jitendra Nath Pandey  
 
7.
Implement Vectorized Scalar-Column expressions Resolved Eric Hanson  
 
8.
Implement vectorized aggregation expressions Resolved Remus Rusanu  
 
9.
Implement vectorized string column-scalar filters Resolved Eric Hanson  
 
10.
Implement vectorized string functions UPPER(), LOWER(), LENGTH() Resolved Eric Hanson  
 
11.
Implement vectorized LIKE filter Resolved Eric Hanson  
 
12.
Vectorized filter and select operators Resolved Jitendra Nath Pandey  
 
13.
Generate vectorized execution plan Resolved Jitendra Nath Pandey  
 
14.
Vectorized expression for unary minus. Resolved Jitendra Nath Pandey  
 
15.
Implement vectorized string concatenation Resolved Eric Hanson  
 
16.
Extend Vector Aggregates to support GROUP BY Resolved Remus Rusanu  
 
17.
Add support for string column type vector aggregates: COUNT, MIN and MAX Resolved Remus Rusanu  
 
18.
Add support for COUNT(*) in vector aggregates Resolved Remus Rusanu  
 
19.
Input format to read vector data from ORC Resolved Jitendra Nath Pandey  
 
20.
Support partitioned tables in vectorized query execution. Resolved Jitendra Nath Pandey  
 
21.
Queries not supported by vectorized code path should fall back to non vector path. Resolved Jitendra Nath Pandey  
 
22.
set isRepeating to false by default in ColumnArithmeticColumn.txt Resolved Eric Hanson  
 
23.
Finish support for modulo (%) operator for vectorized arithmetic Resolved Eric Hanson  
 
24.
Add unit tests for vectorized IS NULL and IS NOT NULL filters Resolved Jitendra Nath Pandey  
 
25.
Extend plan vectorization to cover GroupByOperator Resolved Remus Rusanu  
 
26.
OR, NOT Filter logic can lose an array, and always takes time O(VectorizedRowBatch.DEFAULT_SIZE) Resolved Jitendra Nath Pandey  
 
27.
Improvement in logical expressions and checkstyle fixes. Resolved Jitendra Nath Pandey  
 
28.
remove redundant copy of arithmetic filter unit test testColOpScalarNumericFilterNullAndRepeatingLogic Resolved Eric Hanson  
 
29.
In ORC, add boolean noNulls flag to column stripe metadata Closed Prasanth Jayachandran  
 
30.
Child expressions are not being evaluated hierarchically in a few templates. Resolved Jitendra Nath Pandey  
 
31.
Implement partition support for vectorized query execution Resolved Sarvesh Sakalanaga  
 
32.
Vectorized row batch should be initialized with additional columns to hold intermediate output. Resolved Jitendra Nath Pandey  
 
33.
Template file VectorUDAFAvg.txt missing from public branch; CodeGen.java fails Resolved Remus Rusanu  
 
34.
Input format to read vector data from RC file Resolved Sarvesh Sakalanaga  
 
35.
Implement vectorized filter for string column compared to string column Resolved Eric Hanson  
 
36.
Implement vectorized string substr Resolved Timothy Chen  
 
37.
Integer division should be cast to double. Resolved Jitendra Nath Pandey  
 
38.
Vectorized reader support for Byte Boolean and Timestamp. Resolved Sarvesh Sakalanaga  
 
39.
The vectorized plan is not picking right expression class for string concatenation. Resolved Eric Hanson  
 
40.
Handle constants in projection Resolved Jitendra Nath Pandey  
 
41.
Add partition support for vectorized ORC Input format Resolved Sarvesh Sakalanaga  
 
42.
vectorized NotCol operation does not handle short-circuit evaluation for NULL propagation correctly Resolved Jitendra Nath Pandey  
 
43.
IsNotNull and NotCol incorrectly handle nulls. Resolved Jitendra Nath Pandey  
 
44.
select * fails on orc table when vectorization is enabled Resolved Sarvesh Sakalanaga  
 
45.
only explicit int type works e2e. tiny,small, and big all fail with: org.apache.hadoop.hive.ql.metadata.HiveException: Unsuported JIT vectorization column type Resolved Tony Murphy  
 
46.
Move test utils and fix build to remove false test failures Resolved Tony Murphy  
 
47.
Run check-style on the branch and fix style issues. Resolved Jitendra Nath Pandey  
 
48.
VectorizedRowBatchCtx::CreateVectorizedRowBatch should create only the projected columns and not all columns Resolved Sarvesh Sakalanaga  
 
49.
Speed up vectorized LIKE filter for special cases abc%, %abc and %abc% Resolved Teddy Choi  
 
50.
Vectorized RecordReader for ORC does not set the ColumnVector.IsRepeating correctly Resolved Sarvesh Sakalanaga  
 
51.
Column Column, and Column Scalar vectorized execution tests Resolved Tony Murphy  
 
52.
In place filtering in Not Filter doesn't handle nulls correctly. Resolved Jitendra Nath Pandey  
 
53.
fix failure to set output isNull to true and other NULL propagation issues; update arithmetic tests Resolved Eric Hanson  
 
54.
Support strings in GROUP BY keys Resolved Remus Rusanu  
 
55.
Fix serialization exceptions in VectorGroupByOperator Resolved Remus Rusanu  
 
56.
Remove test code from ql\src\java tree, place it itn ql\src\test tree Resolved Tony Murphy  
 
57.
VectorGroupByOperator steals the non-vectorized children and crashes query if vectorization fails Resolved Jitendra Nath Pandey  
 
58.
Vectorized reader support for timestamp in ORC. Resolved Sarvesh Sakalanaga  
 
59.
Enable running all hive e2e tests under vectorization Resolved Tony Murphy  
 
60.
VectorSelectOperator projections change the index of columns for subsequent operators. Resolved Jitendra Nath Pandey  
 
61. Cleanup column type dependencies in vectorization aggregate code Open Remus Rusanu  
 
62.
Implement vector group by hash spill Resolved Remus Rusanu  
 
63. Support DISTINCT in vectorized aggregates Open Remus Rusanu  
 
64.
Vectorized UDFs for Timestamp in nanoseconds Resolved Gopal V  
 
65.
Vectorized aggregates do not emit proper rows in presence of GROUP BY Resolved Remus Rusanu  
 
66. Improve cache friendliness of VectorHashKeyWrapper Open Remus Rusanu  
 
67.
Integrate Vectorized Substr into Vectorized QE Resolved Eric Hanson  
 
68.
Fix VectorUDAFSum.txt to honor the expected vector column type Resolved Remus Rusanu  
 
69.
CommonOrcInputFormat should be the default input format for Orc tables. Resolved Sarvesh Sakalanaga  
 
70.
Implement vectorized RLIKE and REGEXP filter expressions Resolved Teddy Choi  
 
71.
Unit test failure in TestColumnScalarOperationVectorExpressionEvaluation Resolved Jitendra Nath Pandey  
 
72.
TestVectorGroupByOperator causes asserts in StandardStructObjectInspector.init Resolved Remus Rusanu  
 
73.
VectorHashKeyWrapperBatch.java should be in vector package (instead of exec) Resolved Remus Rusanu  
 
74.
Favor serde2.io Writable classes over hadoop.io ones Resolved Remus Rusanu  
 
75. Remove unused org.apache.hadoop.hive.ql.exec Writables Open Unassigned  
 
76.
Vectorization not working with negative constants, hive doesn't fold constants. Resolved Jitendra Nath Pandey  
 
77. Implement vectorized text reader to read vectorized data from Text file Patch Available Sarvesh Sakalanaga  
 
78. Support Hive specific DISTRIBUTE BY clause in VectorGroupByOperator Open Remus Rusanu  
 
79.
error at VectorExecMapper.close in group-by-agg query over ORC, vectorized Resolved Jitendra Nath Pandey  
 
80.
Count(*) over tpch lineitem ORC results in Error: Java heap space Resolved Sarvesh Sakalanaga  
 
81.
tpch query 1 fails with java.lang.ClassCastException Resolved Jitendra Nath Pandey  
 
82.
wrong results for query with modulo (%) in WHERE clause filter Resolved Sarvesh Sakalanaga  
 
83.
Use VectorExpessionWriter to write column vectors into Writables. Resolved Jitendra Nath Pandey  
 
84. Optimize COUNT(*) aggregate over vectorized ORC execution path Open Unassigned  
 
85.
second clause of AND, OR filter not applied for vectorized execution Resolved Jitendra Nath Pandey  
 
86.
second clause of OR filter not applied in vectorized query execution Resolved Jitendra Nath Pandey  
 
87.
Fix ORC TimestampTreeReader.nextVector() to handle milli-nano math corectly Resolved Gopal V  
 
88.
Query with filter constant on left of "=" and column expression on right does not vectorize Resolved Jitendra Nath Pandey  
 
89.
query using LIKE does not vectorize Resolved Eric Hanson  
 
90.
Max on float returning wrong results Resolved Remus Rusanu  
 
91.
incorrect result for max aggregate over int column Resolved Remus Rusanu  
 
92.
NPE in writing null values. Resolved Jitendra Nath Pandey  
 
93.
Unit test failure in TestColumnColumnOperationVectorExpressionEvaluation Resolved Eric Hanson  
 
94.
Fix ORC TestVectorizedORCReader testcase for Timestamps Resolved Gopal V  
 
95.
Integrate basic UDFs for Timesamp Resolved Gopal V  
 
96.
Optimize filter Column IN ( list-of-constants ) for vectorized execution Resolved Unassigned  
 
97.
Unit test failure TestVectorSelectOperator Resolved Jitendra Nath Pandey  
 
98.
TestCase FakeVectorRowBatchFromObjectIterables error Resolved Eric Hanson  
 
99.
Query on Table with partition columns fail with AlreadyBeingCreatedException Resolved Sarvesh Sakalanaga  
 
100.
Vectorized Sum of scalar subtract column returns negative result when positive exected Resolved Jitendra Nath Pandey  
 
101.
Classcast exception with two group by keys of types string and tinyint. Resolved Remus Rusanu  
 
102.
array out of bounds exception near VectorHashKeyWrapper.getBytes() with 2 column GROUP BY Resolved Remus Rusanu  
 
103.
MIN on timestamp column gives incorrect result. Resolved Gopal V  
 
104.
Optimize ORC StringTreeReader::nextVector to not create dictionary of strings for each call to nextVector Resolved Sarvesh Sakalanaga  
 
105. Float aggregate of single value loses precission Open Remus Rusanu  
 
106.
Unary Minus Expression Throwing java.lang.NullPointerException Resolved Jitendra Nath Pandey  
 
107.
java.lang.RuntimeException: Hive Runtime Error while closing operators: java.lang.ClassCastException: org.apache.hadoop.io.NullWritable cannot be cast to org.apache.hadoop.hive.serde2.io.DoubleWritable Resolved Jitendra Nath Pandey  
 
108.
OrcInputFormat should be enhanced to provide vectorized input. Resolved Jitendra Nath Pandey  
 
109.
NULLs and record separators broken with vectorization branch intermediate outputs Resolved Gopal V  
 
110.
Vectorized ORC reader does not handle absence of column present stream correctly. Resolved Sarvesh Sakalanaga  
 
111.
Null Pointer Exception in Group By Operator Resolved Jitendra Nath Pandey  
 
112.
Hive Runtime Error while closing operators: java.lang.NullPointerException Resolved Remus Rusanu  
 
113.
Incorrect aggregate results Resolved Remus Rusanu  
 
114.
make vectorized LOWER(), UPPER(), LENGTH() work end-to-end; support expression input for vectorized LIKE Resolved Eric Hanson  
 
115.
Unit e2e tests for vectorization Resolved Tony Murphy  
 
116.
Implement vectorized type casting for all types Resolved Eric Hanson  
 
117.
implement vectorized math functions Resolved Eric Hanson  
 
118.
implement vectorized TRIM(), LTRIM(), RTRIM() Resolved Eric Hanson  
 
119.
Make vectorization branch compile under JDK 7 Resolved Ashutosh Chauhan  
 
120.
Implement Vectorized Limit Operator Resolved Sarvesh Sakalanaga  
 
121.
std, stddev and stddev_pop aggregates on double/float fail to vectorize Resolved Remus Rusanu  
 
122.
Implement vectorized JOIN operators Resolved Remus Rusanu  
 
123.
String column comparison classes should be renamed. Resolved Jitendra Nath Pandey  
 
124.
ORC TimestampTreeReader.nextVector() off by a second when time in fractional Resolved Gopal V  
 
125.
make vectorized math functions work end-to-end (update VectorizationContext.java) Resolved Eric Hanson  
 
126.
Vectorized ORC reader does not set isRepeating flag correctly when 1’s are present is the input stream Resolved Sarvesh Sakalanaga  
 
127.
create template for string scalar compared with string column Resolved Eric Hanson  
 
128.
MAX/MIN aggregates yield incorrect results Resolved Remus Rusanu  
 
129.
Make RLIKE/REGEXP run end-to-end by updating VectorizationContext Resolved Teddy Choi  
 
130. Allow prevention of string column re-use for string functions that can set results by reference Open Unassigned  
 
131.
Vectorized plan generation should be added as an optimization transform. Resolved Jitendra Nath Pandey  
 
132.
Create bridge for custom UDFs to operate in vectorized mode Resolved Eric Hanson  
 
133.
Unit test failure in TestVectorTimestampExpressions Resolved Gopal V  
 
134.
Consolidate and simplify vectorization code and test generation Resolved Tony Murphy  
 
135.
Make vector expressions serializable. Resolved Jitendra Nath Pandey  
 
136.
FilterExprOrExpr changes the order of the rows Resolved Jitendra Nath Pandey  
 
137.
Vector operators should inherit from non-vector operators for code re-use. Resolved Jitendra Nath Pandey  
 
138.
Enhance explain to indicate vectorized execution of operators. Resolved Jitendra Nath Pandey  
 
139.
orc_create.q and other orc tests fail on the branch. Resolved Jitendra Nath Pandey  
 
140.
The code generation should be part of the build process. Resolved Jitendra Nath Pandey  
 
141.
Update hive-default.xml.template for vectorization flag; remove unused imports from MetaStoreUtils.java Resolved Jitendra Nath Pandey  
 
142.
Commit vectorization test data, comment/rename vectorization tests. Resolved Tony Murphy  
 
143.
Boolean constants in the query are not handled correctly. Resolved Jitendra Nath Pandey  
 
144. VectorizedRowBatch member variables are public. Reopened Jitendra Nath Pandey  
 
145. Follow convention for placing modifiers in variable declaration. Open Jitendra Nath Pandey  
 
146. Avoid catching Throwable and converting them to exceptions. Open Jitendra Nath Pandey  
 
147.
Refactor VectorizationContext and handle NOT expression with nulls. Resolved Jitendra Nath Pandey  
 
148.
Vectorization throws exception with nested UDF. Resolved Jitendra Nath Pandey  
 
149.
TopN optimization in VectorReduceSink Resolved Sergey Shelukhin  
 
150.
Implement end-to-end tests for vectorized string and math functions, and casts Resolved Eric Hanson  
 
151.
Vectorized query failing for partitioned tables. Resolved Jitendra Nath Pandey  
 
152. Handle virtual columns and schema evolution in vector code path Open Matt McCline  
 
153.
Implement vectorized year/month/day... etc. for string arguments Resolved Teddy Choi  
 
154.
Implement BETWEEN filter in vectorized mode Resolved Eric Hanson  
 
155.
Implement support for IN (list-of-constants) filter in vectorized mode Resolved Eric Hanson  
 
156.
Write initial user documentation for vectorized query on Hive Wiki Resolved Eric Hanson  
 
157.
Exception in vectorized map join. Resolved Jitendra Nath Pandey  
 
158.
Implement vectorized SMB JOIN Resolved Remus Rusanu

0%

 
159.
Fix validation of nested expressions. Resolved Jitendra Nath Pandey  
 
160.
Exception in UDFs with large number of arguments. Resolved Jitendra Nath Pandey  
 
161.
Vectorized Shuffle Join produces incorrect results Resolved Remus Rusanu  
 
162. Supported UDFs should have a separate annotation to indicate they are vectorizable. Open Jitendra Nath Pandey  
 
163.
Validation doesn't catch SMBMapJoin Resolved Jitendra Nath Pandey  
 
164.
Intermediate columns are incorrectly initialized for partitioned tables. Resolved Jitendra Nath Pandey  
 
165.
Add unit test for vectorized BETWEEN for timestamp inputs Resolved Eric Hanson  
 
166. Implement support for BETWEEN in SELECT list Patch Available Navis  
 
167.
Implement vectorization support for IF conditional expression for long, double, timestamp, boolean and string inputs Resolved Eric Hanson  
 
168.
Implement vectorized support for CASE Resolved Eric Hanson  
 
169.
Implement vectorized support for NOT IN filter Resolved Eric Hanson  
 
170.
Implement vectorized support for COALESCE conditional expression Resolved Jitendra Nath Pandey  
 
171.
Implement vectorized support for the DATE data type Resolved Teddy Choi  
 
172. Implement vectorized support for the DECIMAL data type In Progress Eric Hanson  
 
173.
Implement vectorization support for IF conditional expression for boolean and timestamp inputs Resolved Eric Hanson  
 
174.
Implement vectorization support for IF conditional expression for string inputs Resolved Eric Hanson  
 
175. query fails in vectorized mode on empty partitioned table Open Unassigned  
 
176.
Implement vectorized support for IN as boolean-valued expression Resolved Eric Hanson  
 
177.
Implement vectorized support for CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END Resolved Unassigned  
 
178.
Rollups not supported in vector mode. Resolved Jitendra Nath Pandey  
 
179.
Failure in cast to timestamps. Resolved Jitendra Nath Pandey  
 
180.
Add vectorized reader for Parquet files Closed Remus Rusanu  
 
181.
Contribute Decimal128 high-performance decimal(p, s) package from Microsoft to Hive Resolved Eric Hanson  
 
182.
Create DecimalColumnVector and a representative VectorExpression for decimal Resolved Eric Hanson  
 
183.
Implement vectorized decimal comparison filters Resolved Eric Hanson  
 
184.
Support basic Decimal arithmetic in vector mode (+, -, *) Resolved Eric Hanson  
 
185.
Implement vectorized decimal division and modulo Resolved Eric Hanson  
 
186.
Implement vectorized reader for Date datatype for ORC format. Resolved Jitendra Nath Pandey  
 
187.
Implement vectorized reader for DECIMAL datatype for ORC format. Resolved Jitendra Nath Pandey  
 
188.
Implement vectorized type cast from/to decimal(p, s) Resolved Eric Hanson  
 
189.
error in vectorized Column-Column comparison filter for repeating case Resolved Eric Hanson  
 
190.
Make Vector Group By operator abandon grouping if too many distinct keys Resolved Remus Rusanu  
 
191. Implement fast vectorized InputFormat extension for text files Open Eric Hanson  
 
192.
error in high-precision division for Decimal128 Resolved Eric Hanson  
 
193.
Add more unit tests for high-precision Decimal128 arithmetic Resolved Eric Hanson  
 
194.
VectorExpressionWriter for date and decimal datatypes. Resolved Jitendra Nath Pandey  
 
195.
Generate vectorized plan for decimal expressions. Resolved Jitendra Nath Pandey  
 
196.
Add DECIMAL support to vectorized group by operator Resolved Remus Rusanu  
 
197.
Add DECIMAL support to vectorized JOIN operators Resolved Remus Rusanu  
 
198.
Column name map is broken Resolved Jitendra Nath Pandey  
 
199. Extend the alltypesorc test table to include DECIMAL columns Open Unassigned  
 
200.
Implement vectorized unary minus for decimal Resolved Jitendra Nath Pandey  
 
201.
bug in high-precision Decimal128 multiply Resolved Eric Hanson  
 
202.
Vectorized mathematical functions for decimal type. Resolved Jitendra Nath Pandey  
 
203. fix bug in UnsignedInt128.multiplyArrays4And4To8 and revert temporary fix in Decimal128.multiplyDestructive Open Jitendra Nath Pandey  
 
204.
Queries fail to Vectorize. Resolved Jitendra Nath Pandey  
 
205. Remove unnecessary white spaces in vectorization code Patch Available Teddy Choi  
 

    Activity

Jitendra Nath Pandey added a comment - 13/Mar/13 19:57

This will be an incremental work in multiple phases with no regression on current system. We will publish a design/scope document very soon.
The main idea behind the proposal is to transform the execution engine to process a row batch at a time instead of a single row. The row batch will consist of column vectors and each operator will process the whole column vector at a time. The column vector will consist of array(s) of primitive types as far as possible.
The expressions will be implemented for various data types using pre-compiled templates. The appropriate expressions will be added to the operators based on data types.
A vectorized iterator interface will be implemented by the file formats to provide vectorized input to the operator tree.

.
Jitendra Nath Pandey added a comment - 13/Mar/13 19:58
.
Eric Hanson added a comment - 14/Mar/13 00:41

This is part of the Stinger initiative. http://hortonworks.com/blog/100x-faster-hive/

.
Jitendra Nath Pandey added a comment - 18/Mar/13 23:40

The attached document covers the outline of the design. Any comments/feedback are welcome. We will keep updating the document with more details as we include more data types, operators and expressions. We will also include the vectorized iterator design into the document.

.
Eric Hanson added a comment - 06/Apr/13 01:00

Added section on requirements for implementation of vectorized iterator, with respect to how to load VectorizedRowBatch object on each call to next().

.
Steve Loughran added a comment - 10/Apr/13 12:23

We couldn't have a copy of the doc in PDF stuck up at the same time as the editable one could we?

.
Eric Hanson added a comment - 10/Apr/13 18:22

Fixed a bug in example, plus made minor wording changes in introduction.

.
Eric Hanson added a comment - 10/Apr/13 18:29

Adding pdf of design doc per request.

.
Eric Hanson added a comment - 10/Apr/13 18:29

updated version # and date

.
Eric Hanson added a comment - 09/May/13 22:31

Updated design document with discussion of precise handling and interpretation of all-non-null (noNulls) and all identical (isRepeating) column vectors.

Also included discussion of TIMESTAMP internal vector representation as long integer number of nonseconds since the epoch.

.
Eric Hanson added a comment - 10/May/13 00:28

The code for this work is currently in the "vectorization" branch of the public Hive repo.

.
Eric Hanson added a comment - 13/May/13 18:59

Added discussion of timestamp values before the epoch (in 1970) related to HIVE-4525.

.
Eric Hanson added a comment - 28/May/13 22:52

Updated design spec with new section by Remus Rusanu about vectorized group-by/aggregate. I edited it a little bit and added the final paragraph on future considerations.

.
Dmitriy V. Ryaboy added a comment - 03/Jul/13 23:15

Hi folks,
What an incredible amount of work! Looks fantastic, looking forward to this.

It seems like the general idea of a vectorized operator is not Hive-specific. Is there any possibility of abstracting the core logic of an operator that can efficiently process a stream of data, such as what you get from ORCFile, and return the computed results?

Having such a library be available independently of Hive would allow reuse in other Hadoop ecosystem projects (Pig, Cascading, Drill, etc) without the need to reinvent the wheel, and would also bring the whole community behind optimizing one set of operators instead of continuing the existing fragmented state of the world.

The process of separating out such a library might also yield benefits in terms of winding up with a cleaner design and better abstractions (that's been my experience when going through similar exercises on other projects – I don't have any reason to think your current design is not clean or doesn't have good abstractions).

Do you have any thoughts on how this could be achieved? Does this sound like something you would be interested in? Is there something that people currently working on other projects can do to help this become a reality?

.
Vinod Kumar Vavilapalli added a comment - 04/Jul/13 07:28

A huge +1 to that. Having a common set of operators will be a huge win. That said, I already see that the current branch follows Hive's operator base classes, uses HiveConf etc. I believe with little effort, this can be cleaned and pulled apart into one separate maven module that everyone can use.

Some points to think about:

  • The target location of the module. The dependency graph can become un-wieldly.
  • Given the use of base Operator, OperatorDesc etc from Hive, if at all there is interest and commitment, we should do this ASAP when we only have a handful of operators.
  • Make one other project demonstrate how it can be reused across ecosystem projects, PIG will be great - just a few operators will be a great start

Thoughts?

.
Eric Hanson added a comment - 08/Jul/13 21:43

Dmitry and Vinod,

What specifically do you want to do with the code once it is factored out?

Eric

.
Dmitriy V. Ryaboy added a comment - 08/Jul/13 22:05

I would like to provide the same vectorization benefits to Pig and similar frameworks (possibly Cascading, and maybe the Spark or Crunch guys will want to use this as well, etc).

.
Jitendra Nath Pandey added a comment - 11/Jul/13 18:33

Dmitry, Vinod
There is significant amount of vectorization work in expression evaluation for example, arithmetic expressions or logical expressions or aggregations etc. Many of these expressions are pretty generic and different systems are likely to have similar semantics for these. It should be possible to re-use this code with little change in pig or other systems. It will be required to use same vectorized representation of data in the processing engine to re-use these expressions, but that part of code is also generic and re-usable. I think that could be a good starting point.
However, a bunch of the vectorization work is in operator code where we have vectorized version of the hive operators. These operators are closely tied with hive semantics and implementation. Therefore, it will need some restructuring in hive code base as well to generalize these operators for re-use in other projects. Also, at this point we should be thinking more generally about a common physical layer shared between pig and hive. These languages can continue to have different logical plans but it would be desirable that they share common physical plan structure because they both use same map-reduce runtime.

.
Dmitriy V. Ryaboy added a comment - 11/Jul/13 20:30

Jitendra,
I believe physical plan primitives for both Hive and Pig (and potentially others) are going to come in via Tez, as both Pig and Hive want to get off strict MR in the long-term.

I'll take a crack at extracting what's extractable. Right now Hive's UDAF reaches fairly deeply into this code, as you noted, but I think with a little restructuring this can be factored out.

.
Eric Hanson added a comment - 17/Sep/13 23:16

Updated design specification with new section describing the vectorized UDF adaptor (HIVE-4961).

.
Jitendra Nath Pandey added a comment - 01/Oct/13 18:01

Vectorization work has been committed to trunk. Going forward, all the vectorization work will happen on trunk and vectorization branch will be obsolete.

.
Lars Francke added a comment - 04/Oct/13 11:26

This is a huge patch and it's hard to see if it changes anything for the end user. As we'd like to keep the Wiki up-to-date it'd be great if someone could comment whether there are any configuration options besides hive.vectorized.execution.enabled or any other things that should be documented.

Thanks!

.
Eric Hanson added a comment - 04/Oct/13 16:19

I've been planning to write some user documentation for this feature. Where do you think would be a good spot in the wiki to include it?

.
Lefty Leverenz added a comment - 05/Oct/13 10:47

Put it in Design Docs (https://cwiki.apache.org/confluence/display/Hive/DesignDocs) until it's released. Later you can move it into the User Docs with a note about which release introduces it. You can either change the file's location in the hierarchy or leave it in place and just link to it from the User Docs section.

When it goes into User Docs, you have some choices. Does it belong on the Home page or in the Language Manual? If in the Language Manual, do you want it under DML or should it be a stand-alone doc? That depends on what you write and how you want readers to find the doc. You can always add links from other docs to make sure people find it.

Here's the Language Manual: https://cwiki.apache.org/confluence/display/Hive/LanguageManual.

Of course configuration goes here, perhaps in a subsection under Query Execution: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties. I suggest you make a section in your design doc that's formatted to match the configuration doc, so when the time comes you can just cut & paste.

.
Eric Hanson added a comment - 01/Nov/13 17:45
.

    People

  • Assignee:
    Jitendra Nath Pandey
    Reporter:
    Jitendra Nath Pandey
  • Votes:
    2 Vote for this issue
    Watchers:
    53 Start watching this issue

    Dates

  • Created:
    13/Mar/13 19:56
    Updated:
    11/Aug/15 17:29

    Time Tracking

Estimated:
168h
Remaining:
168h
Logged:
Not Specified
Include sub-tasks
  • Powered by a free Atlassian JIRA open source license for Apache Software Foundation. Try JIRA - bug tracking software for your team. ·