Performance of Hive tables with Parquet & ORC

Source: http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy

ORC+Zlib has better performance than Paqruet + Snappy

Datasets

Table A – Text File Format- 2.5GB
Table B – ORC – 652MB
Table C – ORC with Snappy – 802MB
Table D – Parquet – 1.9 GB

Parquet was worst as far as compression for my table is concerned.
My tests with the above tables yielded following results.

Row count operation
Text Format Cumulative CPU – 123.33 sec
Parquet Format Cumulative CPU – 204.92 sec
ORC Format Cumulative CPU – 119.99 sec
ORC with SNAPPY Cumulative CPU – 107.05 sec

Sum of a column operation
Text Format Cumulative CPU – 127.85 sec
Parquet Format Cumulative CPU – 255.2 sec
ORC Format Cumulative CPU – 120.48 sec
ORC with SNAPPY Cumulative CPU – 98.27 sec

Average of a column operation
Text Format Cumulative CPU – 128.79 sec
Parquet Format Cumulative CPU – 211.73 sec
ORC Format Cumulative CPU – 165.5 sec
ORC with SNAPPY Cumulative CPU – 135.45 sec

Selecting 4 columns from a given range using where clause
Text Format Cumulative CPU – 72.48 sec
Parquet Format Cumulative CPU – 136.4 sec
ORC Format Cumulative CPU – 96.63 sec
ORC with SNAPPY Cumulative CPU – 82.05 sec

 

Additional comments

both of these formats has their own specific advantages. Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here).
Apache ORC might be better if your file structure is flatter.

And as far as I know parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be the issue for the better query speed especially when it comes to sum operations.

The Parquet default compression is SNAPPY.

 

 

References:

  1. https://community.hortonworks.com/questions/2067/orc-vs-parquet-when-to-use-one-over-the-other.html
  2. http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

 

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *