Benchmarking ORC improvements in Spark 2.3

Yousry Mohamed
8 min readJul 3, 2019
https://www.flickr.com/photos/bagogames/30778696885

I’ve always been a big fan of Duke Nukem and his motto of “Don’t have time to play with myself”. In the world of big data time is really critical. It does not make sense to waste most of your day waiting for data processing jobs or data analysis queries to complete running. Any improvement in such cases translates directly into huge productivity gains.

Some big data installations depend heavily on ORC files. My guess for this is probably due to the dependency on Hive for which ORC is the standard file format. Spark 2.3 announced major improvements in ORC performance and that was mainly due to allowing the usage of ORC latest library that includes a vectorized reader instead of the legacy one that came with Hive long time ago. Spark 2.3 supports two ORC reading modes. The legacy one using old Hive dependency and this is the default selection to keep Spark backward compatible with existing applications. The other mode called native can be enabled with a config setting and this is the mode supposed to provide the better performance by switching to the vectorized reader. Spark 2.4.0 changed ORC to use the native mode by default. More details on the history and specific improvements could be found in this YouTube video.

The video included a few simple notebooks comparing read performance between the legacy ORC reader and the new native (vectorized) reader. The results were very promising but I don’t trust results from a single experiment or statement as they are probably not statistically reliable. Spark source code itself has a bunch of benchmarking test classes proving the performance superiority of the new native mode.

https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala

I wanted to give it a go myself but also preferred to use a standalone Spark benchmarking tool that I can use in other situations.

Say hi to Spark-Bench

Spark-Bench is a Spark benchmarking and simulation tool as advertised on their dated website 😂. The tool has not received much love recently and it has some rough edges but seemed good enough to give it a chance. There is also a quick video about main features of the tool and how to use it.

Unfortunately Spark-Bench doesn’t work smoothly if you have a local Spark installation on Windows. The other options to try it are:

  • Spark on Linux
  • Spark on Linux in Windows Subsystem for Linux
  • Azure HDInsight cluster
  • Hortonworks Sandbox (local or cloud)
  • Other cloudy or dockerized options

I picked the second option with a Spark 2.3 instance on an Ubuntu on Windows Subsystem for Linux. Benchmarking on a single machine is totally different than a distributed cluster specially for more sophisticated tests but it should give indicative numbers for the case at hand. I recall from the video of ORC improvements that some of the tests are even done using a single thread executor.

I will not go through the process of Spark installation because it is really straightforward. The installation of Spark-Bench is very straightforward of downloading a release file and unzipping it. Then a couple of environment variables have to be created. First one is spark home location and the second is master URL.

export SPARK_HOME=/path/to/my/local/install/of/spark/
export SPARK_MASTER_HOST=local[*]

To minimize the amount of noise in console output, it’s better to edit Spark log4j config file and change default logging level to WARN instead of INFO.

To make sure the tool is ready, it provides a sample example to test a SparkPi program.

./bin/spark-bench.sh examples/minimal-example.conf 

Preparing test data

I wont’t repeat Spark-Bench documentation here but the gist is to prepare some config file with details on what datasets and queries to run against them. The tool does all the plumbing for generating sample datasets and running queries against them then recording the time needed. My target is to check if native ORC mode is faster than the legacy one plus comparing both to parquet as well as a bonus.

Based on that, we need two datasets one for parquet and one for ORC. There is no need to create two ORC datasets as the test is for the vectorized reader against the legacy reader. There are other optimizations around writing ORC files in native mode but they are out of the scope of this article. We need first to create a file called parquet.conf in Spark-Bench home directory and edit it as follows:

Line 15 above has workload name set to “data-generation-kmeans” which is a provided data generator from Spark-Bench used to generate data to test clustering algorithms. The important thing to notice here is setting Spark driver memory on line 4 to a decent amount as we are running in local mode and will generate a dataset with 10 million rows and 50 columns. Also the format of the dataset generated is inferred from the extension of the output path. For the ORC dataset we will create a orc.conf file as below:

It’s pretty obvious that the only difference between both files is the output path extension.

Once both files are created and placed in Spark-Bench home directory, we can execute them one by one so that the tool will use some k-means data generator to create those datasets and write them as requested.

./bin/spark-bench.sh ./parquet.conf
./bin/spark-bench.sh ./orc.conf

Perfect, we have two datasets and they are size-wise good enough for benchmarking purpose.

Running query workload

The next obvious stage is to run queries against both datasets and record times to be analyzed or plotted later. The 3 use cases we have are:

  • Parquet
  • ORC with legacy reader
  • ORC with vectorized (native) reader

For every use case we are benchmarking, we will run 30 queries which are effectively the same query with a bit of randomization to read different portions of the underlying dataset. Actually I think this query needs to read all the records (columnar read 😂) for the columns in WHERE clause due to randn call having a different value per each record.

select sum(c0) as sum, std(c22) as std from input where c0 < c22 * randn()

Comparing parquet to classic ORC we need to create compare-parquet-to-orc.conf as follows:

Workload is set to “sql” on line 18 to instruct Spark-Bench to run queries against input dataset. We have two inputs on line 19 pointing to parquet and ORC datasets. An output path is included on line 21 to force the query to execute as Spark doesn’t trigger execution of queries unless there is an action like counting records or writing result to a destination. I noticed in Spark-Bench documentation, they sometimes ignore that line but I think it’s very critical for the accuracy of final results even if we don’t do anything with the query results (we need only the timing results on line 12).

Speaking of final results, line 12 has the path for the benchmarking result which will be a csv holding time taken for each run of querying the data plus saving the result of the query. The result of the query is a one line aggregation so saving effort is negligible and the time taken is mainly for running the query against the relevant dataset.

To run this workload, execute:

./bin/spark-bench.sh ./compare-parquet-to-orc.conf

You will probably need to get a cup of coffee or do some planks and also make sure you have reliable laptop fans.

If the CPU doesn’t melt, there will be a csv file generated with some timing results we will analyze later.

The next workload file orc-native-query.conf has its contents as follows:

Other than pointing the input property on line 20 to the ORC dataset ONLY, on line 7 Spark is instructed to use the new ORC vectorized reader. The timing results are also redirected to a different csv file.

Next execute the native ORC workload using:

./bin/spark-bench.sh ./orc-native-query.conf

Before analyzing results, I’ve noticed far less time was needed to complete the run plus less memory consumed but feelings could be deceiving and numbers will tell the truth. In all cases, the second run was against a single dataset while the first was on two datasets.

Climax time!

The two result files could be analyzed using Excel, Python, Zeppelin or whatever tool you like. I happen to have a Zeppelin installation available so will do it using Spark in Zeppelin but first I had to copy the two csv files to HDFS accessible to Zeppelin instance.

First we need to load the two files into data frames and register them as temp views.

Next, a simple SQL query to union the two views plus some tweaking to get the use case name and average the query time (which is recorded in nano seconds by Spark-Bench) is all what’s needed to render a meaningful result.

Zeppelin helps also by providing the option to view the result as a chart.

The result talks for itself. Before vectorized ORC reader, parquet was an obvious winner where its average is 6.9 seconds while legacy ORC is 14 seconds. Maybe that’s why parquet was considered the de facto standard file format used with Spark. Now the vectorized ORC reader achieved an average of 4.77 seconds which is even better than parquet and around 3 times faster than the legacy reader.

In the YouTube video I shared about ORC improvements, the instructor even mentioned that the future plans of Spark 2.4 and above are to make ORC the standard file format for Spark.

So, if you have Spark 2.3 or above and want to do less waiting for Spark jobs to complete you better switch to the new ORC vectorized reader.

Happy mining!

--

--

Yousry Mohamed

Yousry is a lead consultant working for Cuusoo. He is very passionate about all things data including Big Data, Machine Learning and AI.