Jay Taylor's notes

back to listing index

[HIVE-4160] Vectorized Query Execution in Hive - ASF JIRA

[web search]

Original source (issues.apache.org)

Tags: vectorized-execution hive issues.apache.org

Clipped on: 2016-05-23

Linked Applicationsnormal

Vectorized Query Execution in Hive

Agile Board

Export

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

The Hive query execution engine currently processes one row at a time. A single row of data goes through all the operators before the next row can be processed. This mode of processing is very inefficient in terms of CPU usage. Research has demonstrated that this yields very low instructions per cycle [MonetDB X100]. Also currently Hive heavily relies on lazy deserialization and data columns go through a layer of object inspectors that identify column type, deserialize data and determine appropriate expression routines in the inner loop. These layers of virtual method calls further slow down the processing.

This work will add support for vectorized query execution to Hive, where, instead of individual rows, batches of about a thousand rows at a time are processed. Each column in the batch is represented as a vector of a primitive data type. The inner loop of execution scans these vectors very fast, avoiding method calls, deserialization, unnecessary if-then-else, etc. This substantially reduces CPU time used, and gives excellent instructions per cycle (i.e. improved processor pipeline utilization). See the attached design specification for more details.

Attachments

Hive-Vectorized-Query-Execution-Design.docx

18/Mar/13 23:40

33 kB

.
Hive-Vectorized-Query-Execution-Design-rev10.docx

25/Jul/13 18:56

41 kB

.
Hive-Vectorized-Query-Execution-Design-rev10.docx

25/Jul/13 18:53

41 kB

.
Hive-Vectorized-Query-Execution-Design-rev10.pdf

25/Jul/13 18:55

665 kB

.
Hive-Vectorized-Query-Execution-Design-rev11.docx

17/Sep/13 23:14

42 kB

.
Hive-Vectorized-Query-Execution-Design-rev11.pdf

17/Sep/13 23:14

671 kB

.
Hive-Vectorized-Query-Execution-Design-rev2.docx

06/Apr/13 01:00

31 kB

.
Hive-Vectorized-Query-Execution-Design-rev3.docx

10/Apr/13 18:29

32 kB

.
Hive-Vectorized-Query-Execution-Design-rev3.docx

10/Apr/13 18:22

32 kB

.
Hive-Vectorized-Query-Execution-Design-rev3.pdf

10/Apr/13 18:29

596 kB

.
Hive-Vectorized-Query-Execution-Design-rev4.docx

26/Apr/13 18:56

32 kB

.
Hive-Vectorized-Query-Execution-Design-rev4.pdf

26/Apr/13 18:57

596 kB

.
Hive-Vectorized-Query-Execution-Design-rev5.docx

09/May/13 22:29

34 kB

.
Hive-Vectorized-Query-Execution-Design-rev5.pdf

09/May/13 22:29

609 kB

.
Hive-Vectorized-Query-Execution-Design-rev6.docx

09/May/13 22:47

34 kB

.
Hive-Vectorized-Query-Execution-Design-rev6.pdf

09/May/13 22:47

609 kB

.
Hive-Vectorized-Query-Execution-Design-rev7.docx

13/May/13 18:59

35 kB

.
Hive-Vectorized-Query-Execution-Design-rev8.docx

28/May/13 22:49

36 kB

.
Hive-Vectorized-Query-Execution-Design-rev8.pdf

28/May/13 22:50

651 kB

.
Hive-Vectorized-Query-Execution-Design-rev9.docx

25/Jun/13 21:22

39 kB

.
Hive-Vectorized-Query-Execution-Design-rev9.pdf

25/Jun/13 21:22

657 kB

.

Issue Links

incorporates

HIVE-5584 Write initial user documentation for vectorized query on Hive Wiki

Resolved

relates to

HIVE-10179 Optimization for SIMD instructions in Hive

Open

Sub-Tasks

Implement vectorized logical expressions.

Resolved

Jitendra Nath Pandey

Implement vectorized column-scalar expressions

Resolved

Jitendra Nath Pandey

Implement class for vectorized row batch

Resolved

Eric Hanson

Implement classes for column vectors.

Resolved

Eric Hanson

Change ORC tree readers to return batches of rows instead of a row

Resolved

Sarvesh Sakalanaga

Implement Vectorized Column-Column expressions

Resolved

Jitendra Nath Pandey

Implement Vectorized Scalar-Column expressions

Resolved

Eric Hanson

Implement vectorized aggregation expressions

Resolved

Remus Rusanu

Implement vectorized string column-scalar filters

Resolved

Eric Hanson

10.

Implement vectorized string functions UPPER(), LOWER(), LENGTH()

Resolved

Eric Hanson

11.

Implement vectorized LIKE filter

Resolved

Eric Hanson

12.

Vectorized filter and select operators

Resolved

Jitendra Nath Pandey

13.

Generate vectorized execution plan

Resolved

Jitendra Nath Pandey

14.

Vectorized expression for unary minus.

Resolved

Jitendra Nath Pandey

15.

Implement vectorized string concatenation

Resolved

Eric Hanson

16.

Extend Vector Aggregates to support GROUP BY

Resolved

Remus Rusanu

17.

Add support for string column type vector aggregates: COUNT, MIN and MAX

Resolved

Remus Rusanu

18.

Add support for COUNT(*) in vector aggregates

Resolved

Remus Rusanu

19.

Input format to read vector data from ORC

Resolved

Jitendra Nath Pandey

20.

Support partitioned tables in vectorized query execution.

Resolved

Jitendra Nath Pandey

21.

Queries not supported by vectorized code path should fall back to non vector path.

Resolved

Jitendra Nath Pandey

22.

set isRepeating to false by default in ColumnArithmeticColumn.txt

Resolved

Eric Hanson

23.

Finish support for modulo (%) operator for vectorized arithmetic

Resolved

Eric Hanson

24.

Add unit tests for vectorized IS NULL and IS NOT NULL filters

Resolved

Jitendra Nath Pandey

25.

Extend plan vectorization to cover GroupByOperator

Resolved

Remus Rusanu

26.

OR, NOT Filter logic can lose an array, and always takes time O(VectorizedRowBatch.DEFAULT_SIZE)

Resolved

Jitendra Nath Pandey

27.

Improvement in logical expressions and checkstyle fixes.

Resolved

Jitendra Nath Pandey

28.

remove redundant copy of arithmetic filter unit test testColOpScalarNumericFilterNullAndRepeatingLogic

Resolved

Eric Hanson

29.

In ORC, add boolean noNulls flag to column stripe metadata

Closed

Prasanth Jayachandran

30.

Child expressions are not being evaluated hierarchically in a few templates.

Resolved

Jitendra Nath Pandey

31.

Implement partition support for vectorized query execution

Resolved

Sarvesh Sakalanaga

32.

Vectorized row batch should be initialized with additional columns to hold intermediate output.

Resolved

Jitendra Nath Pandey

33.

Template file VectorUDAFAvg.txt missing from public branch; CodeGen.java fails

Resolved

Remus Rusanu

34.

Input format to read vector data from RC file

Resolved

Sarvesh Sakalanaga

35.

Implement vectorized filter for string column compared to string column

Resolved

Eric Hanson

36.

Implement vectorized string substr

Resolved

Timothy Chen

37.

Integer division should be cast to double.

Resolved

Jitendra Nath Pandey

38.

Vectorized reader support for Byte Boolean and Timestamp.

Resolved

Sarvesh Sakalanaga

39.

The vectorized plan is not picking right expression class for string concatenation.

Resolved

Eric Hanson

40.

Handle constants in projection

Resolved

Jitendra Nath Pandey

41.

Add partition support for vectorized ORC Input format

Resolved

Sarvesh Sakalanaga

42.

vectorized NotCol operation does not handle short-circuit evaluation for NULL propagation correctly

Resolved

Jitendra Nath Pandey

43.

IsNotNull and NotCol incorrectly handle nulls.

Resolved

Jitendra Nath Pandey

44.

select * fails on orc table when vectorization is enabled

Resolved

Sarvesh Sakalanaga

45.

only explicit int type works e2e. tiny,small, and big all fail with: org.apache.hadoop.hive.ql.metadata.HiveException: Unsuported JIT vectorization column type

Resolved

Tony Murphy

46.

Move test utils and fix build to remove false test failures

Resolved

Tony Murphy

47.

Run check-style on the branch and fix style issues.

Resolved

Jitendra Nath Pandey

48.

VectorizedRowBatchCtx::CreateVectorizedRowBatch should create only the projected columns and not all columns

Resolved

Sarvesh Sakalanaga

49.

Speed up vectorized LIKE filter for special cases abc%, %abc and %abc%

Resolved

Teddy Choi

50.

Vectorized RecordReader for ORC does not set the ColumnVector.IsRepeating correctly

Resolved

Sarvesh Sakalanaga

51.

Column Column, and Column Scalar vectorized execution tests

Resolved

Tony Murphy

52.

In place filtering in Not Filter doesn't handle nulls correctly.

Resolved

Jitendra Nath Pandey

53.

fix failure to set output isNull to true and other NULL propagation issues; update arithmetic tests

Resolved

Eric Hanson

54.

Support strings in GROUP BY keys

Resolved

Remus Rusanu

55.

Fix serialization exceptions in VectorGroupByOperator

Resolved

Remus Rusanu

56.

Remove test code from ql\src\java tree, place it itn ql\src\test tree

Resolved

Tony Murphy

57.

VectorGroupByOperator steals the non-vectorized children and crashes query if vectorization fails

Resolved

Jitendra Nath Pandey

58.

Vectorized reader support for timestamp in ORC.

Resolved

Sarvesh Sakalanaga

59.

Enable running all hive e2e tests under vectorization

Resolved

Tony Murphy

60.

VectorSelectOperator projections change the index of columns for subsequent operators.

Resolved

Jitendra Nath Pandey

61.

Cleanup column type dependencies in vectorization aggregate code

Open

Remus Rusanu

62.

Implement vector group by hash spill

Resolved

Remus Rusanu

63.

Support DISTINCT in vectorized aggregates

Open

Remus Rusanu

64.

Vectorized UDFs for Timestamp in nanoseconds

Resolved

Gopal V

65.

Vectorized aggregates do not emit proper rows in presence of GROUP BY

Resolved

Remus Rusanu

66.

Improve cache friendliness of VectorHashKeyWrapper

Open

Remus Rusanu

67.

Integrate Vectorized Substr into Vectorized QE

Resolved

Eric Hanson

68.

Fix VectorUDAFSum.txt to honor the expected vector column type

Resolved

Remus Rusanu

69.

CommonOrcInputFormat should be the default input format for Orc tables.

Resolved

Sarvesh Sakalanaga

70.

Implement vectorized RLIKE and REGEXP filter expressions

Resolved

Teddy Choi

71.

Unit test failure in TestColumnScalarOperationVectorExpressionEvaluation

Resolved

Jitendra Nath Pandey

72.

TestVectorGroupByOperator causes asserts in StandardStructObjectInspector.init

Resolved

Remus Rusanu

73.

VectorHashKeyWrapperBatch.java should be in vector package (instead of exec)

Resolved

Remus Rusanu

74.

Favor serde2.io Writable classes over hadoop.io ones

Resolved

Remus Rusanu

75.

Remove unused org.apache.hadoop.hive.ql.exec Writables

Open

Unassigned

76.

Vectorization not working with negative constants, hive doesn't fold constants.

Resolved

Jitendra Nath Pandey

77.

Implement vectorized text reader to read vectorized data from Text file

Patch Available

Sarvesh Sakalanaga

78.

Support Hive specific DISTRIBUTE BY clause in VectorGroupByOperator

Open

Remus Rusanu

79.

error at VectorExecMapper.close in group-by-agg query over ORC, vectorized

Resolved

Jitendra Nath Pandey

80.

Count(*) over tpch lineitem ORC results in Error: Java heap space

Resolved

Sarvesh Sakalanaga

81.

tpch query 1 fails with java.lang.ClassCastException

Resolved

Jitendra Nath Pandey

82.

wrong results for query with modulo (%) in WHERE clause filter

Resolved

Sarvesh Sakalanaga

83.

Use VectorExpessionWriter to write column vectors into Writables.

Resolved

Jitendra Nath Pandey

84.

Optimize COUNT(*) aggregate over vectorized ORC execution path

Open

Unassigned

85.

second clause of AND, OR filter not applied for vectorized execution

Resolved

Jitendra Nath Pandey

86.

second clause of OR filter not applied in vectorized query execution

Resolved

Jitendra Nath Pandey

87.

Fix ORC TimestampTreeReader.nextVector() to handle milli-nano math corectly

Resolved

Gopal V

88.

Query with filter constant on left of "=" and column expression on right does not vectorize

Resolved

Jitendra Nath Pandey

89.

query using LIKE does not vectorize

Resolved

Eric Hanson

90.

Max on float returning wrong results

Resolved

Remus Rusanu

91.

incorrect result for max aggregate over int column

Resolved

Remus Rusanu

92.

NPE in writing null values.

Resolved

Jitendra Nath Pandey

93.

Unit test failure in TestColumnColumnOperationVectorExpressionEvaluation

Resolved

Eric Hanson

94.

Fix ORC TestVectorizedORCReader testcase for Timestamps

Resolved

Gopal V

95.

Integrate basic UDFs for Timesamp

Resolved

Gopal V

96.

Optimize filter Column IN ( list-of-constants ) for vectorized execution

Resolved

Unassigned

97.

Unit test failure TestVectorSelectOperator

Resolved

Jitendra Nath Pandey

98.

TestCase FakeVectorRowBatchFromObjectIterables error

Resolved

Eric Hanson

99.

Query on Table with partition columns fail with AlreadyBeingCreatedException

Resolved

Sarvesh Sakalanaga

100.

Vectorized Sum of scalar subtract column returns negative result when positive exected

Resolved

Jitendra Nath Pandey

101.

Classcast exception with two group by keys of types string and tinyint.

Resolved

Remus Rusanu

102.

array out of bounds exception near VectorHashKeyWrapper.getBytes() with 2 column GROUP BY

Resolved

Remus Rusanu

103.

MIN on timestamp column gives incorrect result.

Resolved

Gopal V

104.

Optimize ORC StringTreeReader::nextVector to not create dictionary of strings for each call to nextVector

Resolved

Sarvesh Sakalanaga

105.

Float aggregate of single value loses precission

Open

Remus Rusanu

106.

Unary Minus Expression Throwing java.lang.NullPointerException

Resolved

Jitendra Nath Pandey

107.

java.lang.RuntimeException: Hive Runtime Error while closing operators: java.lang.ClassCastException: org.apache.hadoop.io.NullWritable cannot be cast to org.apache.hadoop.hive.serde2.io.DoubleWritable

Resolved

Jitendra Nath Pandey

108.

OrcInputFormat should be enhanced to provide vectorized input.

Resolved

Jitendra Nath Pandey

109.

NULLs and record separators broken with vectorization branch intermediate outputs

Resolved

Gopal V

110.

Vectorized ORC reader does not handle absence of column present stream correctly.

Resolved

Sarvesh Sakalanaga

111.

Null Pointer Exception in Group By Operator

Resolved

Jitendra Nath Pandey

112.

Hive Runtime Error while closing operators: java.lang.NullPointerException

Resolved

Remus Rusanu

113.

Incorrect aggregate results

Resolved

Remus Rusanu

114.

make vectorized LOWER(), UPPER(), LENGTH() work end-to-end; support expression input for vectorized LIKE

Resolved

Eric Hanson

115.

Unit e2e tests for vectorization

Resolved

Tony Murphy

116.

Implement vectorized type casting for all types

Resolved

Eric Hanson

117.

implement vectorized math functions

Resolved

Eric Hanson

118.

implement vectorized TRIM(), LTRIM(), RTRIM()

Resolved

Eric Hanson

119.

Make vectorization branch compile under JDK 7

Resolved

Ashutosh Chauhan

120.

Implement Vectorized Limit Operator

Resolved

Sarvesh Sakalanaga

121.

std, stddev and stddev_pop aggregates on double/float fail to vectorize

Resolved

Remus Rusanu

122.

Implement vectorized JOIN operators

Resolved

Remus Rusanu

123.

String column comparison classes should be renamed.

Resolved

Jitendra Nath Pandey

124.

ORC TimestampTreeReader.nextVector() off by a second when time in fractional

Resolved

Gopal V

125.

make vectorized math functions work end-to-end (update VectorizationContext.java)

Resolved

Eric Hanson

126.

Vectorized ORC reader does not set isRepeating flag correctly when 1’s are present is the input stream

Resolved

Sarvesh Sakalanaga

127.

create template for string scalar compared with string column

Resolved

Eric Hanson

128.

MAX/MIN aggregates yield incorrect results

Resolved

Remus Rusanu

129.

Make RLIKE/REGEXP run end-to-end by updating VectorizationContext

Resolved

Teddy Choi

130.

Allow prevention of string column re-use for string functions that can set results by reference

Open

Unassigned

131.

Vectorized plan generation should be added as an optimization transform.

Resolved

Jitendra Nath Pandey

132.

Create bridge for custom UDFs to operate in vectorized mode

Resolved

Eric Hanson

133.

Unit test failure in TestVectorTimestampExpressions

Resolved

Gopal V

134.

Consolidate and simplify vectorization code and test generation

Resolved

Tony Murphy

135.

Make vector expressions serializable.

Resolved

Jitendra Nath Pandey

136.

FilterExprOrExpr changes the order of the rows

Resolved

Jitendra Nath Pandey

137.

Vector operators should inherit from non-vector operators for code re-use.

Resolved

Jitendra Nath Pandey

138.

Enhance explain to indicate vectorized execution of operators.

Resolved

Jitendra Nath Pandey

139.

orc_create.q and other orc tests fail on the branch.

Resolved

Jitendra Nath Pandey

140.

The code generation should be part of the build process.

Resolved

Jitendra Nath Pandey

141.

Update hive-default.xml.template for vectorization flag; remove unused imports from MetaStoreUtils.java

Resolved

Jitendra Nath Pandey

142.

Commit vectorization test data, comment/rename vectorization tests.

Resolved

Tony Murphy

143.

Boolean constants in the query are not handled correctly.

Resolved

Jitendra Nath Pandey

144.

VectorizedRowBatch member variables are public.

Reopened

Jitendra Nath Pandey

145.

Follow convention for placing modifiers in variable declaration.

Open

Jitendra Nath Pandey

146.

Avoid catching Throwable and converting them to exceptions.

Open

Jitendra Nath Pandey

147.

Refactor VectorizationContext and handle NOT expression with nulls.

Resolved

Jitendra Nath Pandey

148.

Vectorization throws exception with nested UDF.

Resolved

Jitendra Nath Pandey

149.

TopN optimization in VectorReduceSink

Resolved

Sergey Shelukhin

150.

Implement end-to-end tests for vectorized string and math functions, and casts

Resolved

Eric Hanson

151.

Vectorized query failing for partitioned tables.

Resolved

Jitendra Nath Pandey

152.

Handle virtual columns and schema evolution in vector code path

Open

Matt McCline

153.

Implement vectorized year/month/day... etc. for string arguments

Resolved

Teddy Choi

154.

Implement BETWEEN filter in vectorized mode

Resolved

Eric Hanson

155.

Implement support for IN (list-of-constants) filter in vectorized mode

Resolved

Eric Hanson

156.

Write initial user documentation for vectorized query on Hive Wiki

Resolved

Eric Hanson

157.

Exception in vectorized map join.

Resolved

Jitendra Nath Pandey

158.

Implement vectorized SMB JOIN

Resolved

Remus Rusanu

159.

Fix validation of nested expressions.

Resolved

Jitendra Nath Pandey

160.

Exception in UDFs with large number of arguments.

Resolved

Jitendra Nath Pandey

161.

Vectorized Shuffle Join produces incorrect results

Resolved

Remus Rusanu

162.

Supported UDFs should have a separate annotation to indicate they are vectorizable.

Open

Jitendra Nath Pandey

163.

Validation doesn't catch SMBMapJoin

Resolved

Jitendra Nath Pandey

164.

Intermediate columns are incorrectly initialized for partitioned tables.

Resolved

Jitendra Nath Pandey

165.

Add unit test for vectorized BETWEEN for timestamp inputs

Resolved

Eric Hanson

166.

Implement support for BETWEEN in SELECT list

Patch Available

Navis

167.

Implement vectorization support for IF conditional expression for long, double, timestamp, boolean and string inputs

Resolved

Eric Hanson

168.

Implement vectorized support for CASE

Resolved

Eric Hanson

169.

Implement vectorized support for NOT IN filter

Resolved

Eric Hanson

170.

Implement vectorized support for COALESCE conditional expression

Resolved

Jitendra Nath Pandey

171.

Implement vectorized support for the DATE data type

Resolved

Teddy Choi

172.

Implement vectorized support for the DECIMAL data type

In Progress

Eric Hanson

173.

Implement vectorization support for IF conditional expression for boolean and timestamp inputs

Resolved

Eric Hanson

174.

Implement vectorization support for IF conditional expression for string inputs

Resolved

Eric Hanson

175.

query fails in vectorized mode on empty partitioned table

Open

Unassigned

176.

Implement vectorized support for IN as boolean-valued expression

Resolved

Eric Hanson

177.

Implement vectorized support for CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END

Resolved

Unassigned

178.

Rollups not supported in vector mode.

Resolved

Jitendra Nath Pandey

179.

Failure in cast to timestamps.

Resolved

Jitendra Nath Pandey

180.

Add vectorized reader for Parquet files

Closed

Remus Rusanu

181.

Contribute Decimal128 high-performance decimal(p, s) package from Microsoft to Hive

Resolved

Eric Hanson

182.

Create DecimalColumnVector and a representative VectorExpression for decimal

Resolved

Eric Hanson

183.

Implement vectorized decimal comparison filters

Resolved

Eric Hanson

184.

Support basic Decimal arithmetic in vector mode (+, -, *)

Resolved

Eric Hanson

185.

Implement vectorized decimal division and modulo

Resolved

Eric Hanson

186.

Implement vectorized reader for Date datatype for ORC format.

Resolved

Jitendra Nath Pandey

187.

Implement vectorized reader for DECIMAL datatype for ORC format.

Resolved

Jitendra Nath Pandey

188.

Implement vectorized type cast from/to decimal(p, s)

Resolved

Eric Hanson

189.

error in vectorized Column-Column comparison filter for repeating case

Resolved

Eric Hanson

190.

Make Vector Group By operator abandon grouping if too many distinct keys

Resolved

Remus Rusanu

191.

Implement fast vectorized InputFormat extension for text files

Open

Eric Hanson

192.

error in high-precision division for Decimal128

Resolved

Eric Hanson

193.

Add more unit tests for high-precision Decimal128 arithmetic

Resolved

Eric Hanson

194.

VectorExpressionWriter for date and decimal datatypes.

Resolved

Jitendra Nath Pandey

195.

Generate vectorized plan for decimal expressions.

Resolved

Jitendra Nath Pandey

196.

Add DECIMAL support to vectorized group by operator

Resolved

Remus Rusanu

197.

Add DECIMAL support to vectorized JOIN operators

Resolved

Remus Rusanu

198.

Column name map is broken

Resolved

Jitendra Nath Pandey

199.

Extend the alltypesorc test table to include DECIMAL columns

Open

Unassigned

200.

Implement vectorized unary minus for decimal

Resolved

Jitendra Nath Pandey

201.

bug in high-precision Decimal128 multiply

Resolved

Eric Hanson

202.

Vectorized mathematical functions for decimal type.

Resolved

Jitendra Nath Pandey

203.

fix bug in UnsignedInt128.multiplyArrays4And4To8 and revert temporary fix in Decimal128.multiplyDestructive

Open

Jitendra Nath Pandey

204.

Queries fail to Vectorize.

Resolved

Jitendra Nath Pandey

205.

Remove unnecessary white spaces in vectorization code

Patch Available

Teddy Choi

Activity

Jitendra Nath Pandey added a comment - 13/Mar/13 19:57

This will be an incremental work in multiple phases with no regression on current system. We will publish a design/scope document very soon.
The main idea behind the proposal is to transform the execution engine to process a row batch at a time instead of a single row. The row batch will consist of column vectors and each operator will process the whole column vector at a time. The column vector will consist of array(s) of primitive types as far as possible.
The expressions will be implemented for various data types using pre-compiled templates. The appropriate expressions will be added to the operators based on data types.
A vectorized iterator interface will be implemented by the file formats to provide vectorized input to the operator tree.

Jitendra Nath Pandey added a comment - 13/Mar/13 19:58

Reference on MonetDB: http://www-db.cs.wisc.edu/cidr/cidr2005/papers/P19.pdf

Eric Hanson added a comment - 14/Mar/13 00:41

This is part of the Stinger initiative. http://hortonworks.com/blog/100x-faster-hive/

Jitendra Nath Pandey added a comment - 18/Mar/13 23:40

The attached document covers the outline of the design. Any comments/feedback are welcome. We will keep updating the document with more details as we include more data types, operators and expressions. We will also include the vectorized iterator design into the document.

Eric Hanson added a comment - 06/Apr/13 01:00

Added section on requirements for implementation of vectorized iterator, with respect to how to load VectorizedRowBatch object on each call to next().

Steve Loughran added a comment - 10/Apr/13 12:23

We couldn't have a copy of the doc in PDF stuck up at the same time as the editable one could we?

Eric Hanson added a comment - 10/Apr/13 18:22

Fixed a bug in example, plus made minor wording changes in introduction.

Eric Hanson added a comment - 10/Apr/13 18:29

Adding pdf of design doc per request.

Eric Hanson added a comment - 10/Apr/13 18:29

updated version # and date

Eric Hanson added a comment - 09/May/13 22:31

Updated design document with discussion of precise handling and interpretation of all-non-null (noNulls) and all identical (isRepeating) column vectors.

Also included discussion of TIMESTAMP internal vector representation as long integer number of nonseconds since the epoch.

Eric Hanson added a comment - 10/May/13 00:28

The code for this work is currently in the "vectorization" branch of the public Hive repo.

Eric Hanson added a comment - 13/May/13 18:59

Added discussion of timestamp values before the epoch (in 1970) related to ~~HIVE-4525~~.

Eric Hanson added a comment - 28/May/13 22:52

Updated design spec with new section by Remus Rusanu about vectorized group-by/aggregate. I edited it a little bit and added the final paragraph on future considerations.

Dmitriy V. Ryaboy added a comment - 03/Jul/13 23:15

Hi folks,
What an incredible amount of work! Looks fantastic, looking forward to this.

It seems like the general idea of a vectorized operator is not Hive-specific. Is there any possibility of abstracting the core logic of an operator that can efficiently process a stream of data, such as what you get from ORCFile, and return the computed results?

Having such a library be available independently of Hive would allow reuse in other Hadoop ecosystem projects (Pig, Cascading, Drill, etc) without the need to reinvent the wheel, and would also bring the whole community behind optimizing one set of operators instead of continuing the existing fragmented state of the world.

The process of separating out such a library might also yield benefits in terms of winding up with a cleaner design and better abstractions (that's been my experience when going through similar exercises on other projects – I don't have any reason to think your current design is not clean or doesn't have good abstractions).

Do you have any thoughts on how this could be achieved? Does this sound like something you would be interested in? Is there something that people currently working on other projects can do to help this become a reality?

Vinod Kumar Vavilapalli added a comment - 04/Jul/13 07:28

A huge +1 to that. Having a common set of operators will be a huge win. That said, I already see that the current branch follows Hive's operator base classes, uses HiveConf etc. I believe with little effort, this can be cleaned and pulled apart into one separate maven module that everyone can use.

Some points to think about:

The target location of the module. The dependency graph can become un-wieldly.
Given the use of base Operator, OperatorDesc etc from Hive, if at all there is interest and commitment, we should do this ASAP when we only have a handful of operators.
Make one other project demonstrate how it can be reused across ecosystem projects, PIG will be great - just a few operators will be a great start

Thoughts?

Eric Hanson added a comment - 08/Jul/13 21:43

Dmitry and Vinod,

What specifically do you want to do with the code once it is factored out?

Eric

Dmitriy V. Ryaboy added a comment - 08/Jul/13 22:05

I would like to provide the same vectorization benefits to Pig and similar frameworks (possibly Cascading, and maybe the Spark or Crunch guys will want to use this as well, etc).

Jitendra Nath Pandey added a comment - 11/Jul/13 18:33

Dmitry, Vinod
There is significant amount of vectorization work in expression evaluation for example, arithmetic expressions or logical expressions or aggregations etc. Many of these expressions are pretty generic and different systems are likely to have similar semantics for these. It should be possible to re-use this code with little change in pig or other systems. It will be required to use same vectorized representation of data in the processing engine to re-use these expressions, but that part of code is also generic and re-usable. I think that could be a good starting point.
However, a bunch of the vectorization work is in operator code where we have vectorized version of the hive operators. These operators are closely tied with hive semantics and implementation. Therefore, it will need some restructuring in hive code base as well to generalize these operators for re-use in other projects. Also, at this point we should be thinking more generally about a common physical layer shared between pig and hive. These languages can continue to have different logical plans but it would be desirable that they share common physical plan structure because they both use same map-reduce runtime.

Dmitriy V. Ryaboy added a comment - 11/Jul/13 20:30

Jitendra,
I believe physical plan primitives for both Hive and Pig (and potentially others) are going to come in via Tez, as both Pig and Hive want to get off strict MR in the long-term.

I'll take a crack at extracting what's extractable. Right now Hive's UDAF reaches fairly deeply into this code, as you noted, but I think with a little restructuring this can be factored out.

Eric Hanson added a comment - 17/Sep/13 23:16

Updated design specification with new section describing the vectorized UDF adaptor (~~HIVE-4961~~).

Jitendra Nath Pandey added a comment - 01/Oct/13 18:01

Vectorization work has been committed to trunk. Going forward, all the vectorization work will happen on trunk and vectorization branch will be obsolete.

Lars Francke added a comment - 04/Oct/13 11:26

This is a huge patch and it's hard to see if it changes anything for the end user. As we'd like to keep the Wiki up-to-date it'd be great if someone could comment whether there are any configuration options besides hive.vectorized.execution.enabled or any other things that should be documented.

Thanks!

Eric Hanson added a comment - 04/Oct/13 16:19

I've been planning to write some user documentation for this feature. Where do you think would be a good spot in the wiki to include it?

Lefty Leverenz added a comment - 05/Oct/13 10:47

Put it in Design Docs (https://cwiki.apache.org/confluence/display/Hive/DesignDocs) until it's released. Later you can move it into the User Docs with a note about which release introduces it. You can either change the file's location in the hierarchy or leave it in place and just link to it from the User Docs section.

When it goes into User Docs, you have some choices. Does it belong on the Home page or in the Language Manual? If in the Language Manual, do you want it under DML or should it be a stand-alone doc? That depends on what you write and how you want readers to find the doc. You can always add links from other docs to make sure people find it.

Here's the Language Manual: https://cwiki.apache.org/confluence/display/Hive/LanguageManual.

Of course configuration goes here, perhaps in a subsection under Query Execution: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties. I suggest you make a section in your design doc that's formatted to match the configuration doc, so when the time comes you can just cut & paste.

Eric Hanson added a comment - 01/Nov/13 17:45

I put initial documentation at:
https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution

People

Assignee:

Jitendra Nath Pandey

Reporter:

Jitendra Nath Pandey

Votes:

2 Vote for this issue

Watchers:

53 Start watching this issue

Dates

Created:

13/Mar/13 19:56

Updated:

11/Aug/15 17:29

Time Tracking

Estimated:

168h

Remaining:

168h

Logged:

Not Specified

Include sub-tasks

Agile

View on Board

Atlassian JIRA Project Management Software (v6.3.4#6332-sha1:51bc225) ·
About JIRA ·
Report a problem ·

Powered by a free Atlassian JIRA open source license for Apache Software Foundation. Try JIRA - bug tracking software for your team. ·

Atlassian