Jailbreak Oozie Spark action

Yousry Mohamed
5 min readAug 26, 2019

--

Running Spark Scala scripts in tightened environments

Disclaimer

The method explained in this post worked on Spark 2.2.0. Unfortunately, when I tried it on 2.3.2, it failed. So if your Spark version is beyond 2.2.0, you can read for fun but don’t expect a working solution. You can use SparkILoop similar to how Zeppelin runs Spark code. For more info, have a look on here.

In some controlled Hadoop environments, shell access is not allowed. Thus the only supported way to run Spark jobs is either via Oozie workflows with a compiled jar or Zeppelin notebooks. Zeppelin is generally fine but there are many concerns around how much compute resources are available and how long a job could be running.

It’s also not very productive to always require a compiled jar to run a Spark job in Oozie. PySpark is available but it’s generally less performant than Scala. What is really missing is the option to run a Scala script from a shell like:

spark-shell -i some-scala-script.scala

This is very crucial for rapid experimentation which is a clear productivity boost. Let’s see how we can use an Oozie workflow to run a Spark job sourced from a Scala script.

Investigating the problem

Oozie uses something very similar to spark-submit to submit jobs to Spark. At least from the Spark action UI perspective, we can see a reference to a jar file plus fully qualified class name inside the jar.

Cool, the problem boils down to running spark-shell using spark-submit. But how to do that? Let’s first use some Spark environment variable to see how spark-shell command is expanded behind the scenes. The environment variable is SPARK_PRINT_LAUNCH_COMMAND and it prints out the underlying OS command used to trigger the shell.

It’s a java program executed with a bunch of arguments. The program is org.apache.spark.deploy.SparkSubmit and the class it takes to run is org.apache.saprk.repl.Main.

Before trying this idea on an Oozie workflow, let’s prove the concept locally.

P.S. With some trial and error I found that argument -Dscala.usejavascp=ture should be used to make things work and it has to be passed as part of Spark options.

So the following snippet is supposed to run a Scala script from spark-submit

$env:SPARK_SUBMIT_OPTS = "-Dscala.usejavacp=true"Spark-Submit --class org.apache.spark.repl.Main --name "My Cool Spark App" -i sample-job.scala

Let’s try things with the following trivial Scala script:

And here is the result.

The last line of Scala script has a System.exit(0) statement to force exiting Spark shell after executing the requested script otherwise the shell will be waiting for user input.

This proof of concept sounds promising and it should work in Oozie theoretically.

Oozie workflow XML definition

I will share the final workflow definition and point out the important bits.

  • Starting line 3, HCAT credentials have to be defined as they will be used from Spark action. In most cases Spark will be initialised with HIVE support and those credentials are mandatory.
  • On line 27, class used by Spark action is defined to be org.apach.spark.repl.Main
  • On line 28, the absolute HDFS path of spark-repl jar is provided. This is the jar that contains the REPL main class and you can grab it from any Spark logs for any Spark job executed from Oozie on this cluster. Just search for spark-repl inside the logs and you will find it.
  • usejavacp spark options id listed on line 29
  • Lines 30 & 31 contain the arguments needed to feed the script file to Spark action
  • Line 32 is very critical. It serves two purposes which are defining a workflow parameter representing Scala script location and also aliasing this file with the name scrip.scala that is used on the previous line. This way the workflow definition is stored only once on HDFS and could be run using different script files located elsewhere on the file system

Execution time!

Let’s try the above on a real cluster using another Scala script. BTW, we are not limited to small scripts with print statements. The script can do much heavier tasks and write to HDFS as long as enough compute resources are allocated using Spark options in Oozie workflow. So the script used in the test here does a more practical job of aggregating some data and writing to csv file.

Checking yarn application list shows the following which proves there is a Spark job running.

After waiting a couple minutes for the job to complete, Spark history server shows it has worked as expected and inspecting the output location confirms the expected csv file has been written.

Notes

  • Although Spark job will complete normally and shell will be exited, YARN will mark Spark application status as FAILED and thus Oozie workflow will be marked KILLED as well. This could be a side effect of the workaround or maybe we need to signal shell exit in a different way.
  • We can import external libraries hosted on the cluster inside Scala script. This can be done by adding --jars hdfs://foo/bar/jar-name.jar to Spark options element of workflow definition. For example, if we need to import Google S2 spatial library as in the below script:

We would upload the required jar and add its HDFS path to the Spark config mentioned above. The above script shows the effect of decimal precision on accuracy of a spatial location, for example a single digit delta on latitude scale maps to roughly 111 km. The result viewed in Excel looks like the following:

Conclusion

This hack may be a good starting point for rapid experimentation with Spark jobs on tightened (enterprise-y 😉) Hadoop environments.

--

--

Yousry Mohamed
Yousry Mohamed

Written by Yousry Mohamed

Yousry is a principal data engineer working for Mantel Group. He is very passionate about all things data including Big Data, Machine Learning and AI.

No responses yet