Delta Lake Universal Format — A First Look

Yousry Mohamed
Level Up Coding
Published in
8 min readJul 15, 2023

--

Image generated by Bing Image Creator

We all have biases and that applies to food, political views up to your prefered data lake storage format. The lakehouse has become a prominant pattern and its core idea is to have a table layer on top of cheap cloud storage. There are three major open source players in the field:

  • Delta lake from Databricks
  • Apache Hudi from Uber
  • Apache Iceberg from Netflix

As usual, there will be an endless debate about which one is the best among others! In this post, we will have a first look on a new feature from delta lake that allows a delta lake table to be read as if it is an iceberg or hudi table. The new feature is called universal format (shortened as uniform) and it is still in preview and supports iceberg at the moment.

The idea is conceptually simple. All the three formats are just a metadata layer on top of parquet files. So why not write as much metadata flavours as we want such that the table will be readable as delta lake or iceberg or hudi.

This post explains the basic steps to write a delta lake table with universal format enabled for iceberg. Then BigQuery will be used to query the data as an iceberg external table.

Although trying the new feature on a Databricks environment is much easier because most of the plumbing is done for us plus integration with Unity Catalog, but this post will be completetly Databricks-less. We will use the OSS delta lake flavour and run Spark on local machine for simiplicty.

Setup

This tutroial was prepared on a Mac but should work fine on Linux or WSL No one likes to install Spark on Windows!

You will also need a GCP account. A trial account would be enough.

Start with a fresh Python environment and install PySpark.

pip install pyspark==3.4.1

Create an empty folder somewhere, move to that folder and then download the jar required to allow Spark to write to Google Cloud Storage (GCS).

wget https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar

Follow the next steps inside an existing GCP project. Create a GCS bucket, in my case I named the bucket my-lakehouse so please adjust the next steps to use your own bucket name.

Create a service account called my-service-account. Grab the email of that service account and head over to the bucket permissions section. Grant that service account email Storage Object Admin role or any least-privilege role to allow reading and writing data to the bucket.

Image by author

Go to service account details page then switch to keys tab and add a new JSON key and download it to your local machine into the same folder having the jar from previous step (just for simplity).

Image by author

Test writing classic DataFrames to GCS

This has nothing to do with iceberg and uniform format but just preparing Spark and verifying that we can write to GCS.

export GOOGLE_APPLICATION_CREDENTIALS=<ABSOLUTE-PATH-OF-SERVICE-ACCOUNT-CREDENTIALS-FILE>

pyspark \
--packages io.delta:delta-spark_2.12:3.0.0rc1,io.delta:delta-iceberg_2.12:3.0.0rc1,org.apache.iceberg:iceberg-spark3-runtime:0.13.0 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \
--jars ./gcs-connector-latest-hadoop2.jar \
--conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

The export statement is required to make Spark authenticate with GCP using the service account identity. The next line starts PySpark with the following settings:

  • Wiring up packages for delta lake 3.0 RC1, another package to allow delta lake to write iceberg metadata and finally a Spark iceberg runtime to read iceberg tables using Spark for testing purposes.
  • Spark SQL extensions to configure delta lake integration within Spark.
  • GCS jar that is required to write Spark DataFrames to GCS.
  • And finally a Spark config to let Spark use GCS file system URIs.

Now Spark shell should be ready and we can create a dummy DataFrame and write it to GCS.

df = spark.sql("SELECT id, concat('record ', id) as name, cast(rand() * 1000 as int) as value FROM RANGE(1000)")
df.show(5)
df.write.parquet("gs://my-lakehouse/draft-data")

If the above is successful, there will be a new folder called draft-data on GCS with a few parquet files.

Image by author

Write delta lake table with uniform enabled

Now that we have a Spark session capable of writing to GCS, we will write the same DataFrame as a delta lake table but also exploit the new uniform format and add iceberg metadata as well.

Still in PySpark shell, run the below snippet.

spark.conf.set("spark.databricks.delta.write.dataFilesToSubdir", "true")

(
df
.write
.format("delta")
.option('delta.columnMapping.mode' , 'name')
.option('delta.universalFormat.enabledFormats', 'iceberg')
.save("gs://my-lakehouse/my-uniform-table")
)

The first statement sets a Spark config variable to write delta lake table parquet files to a folder called data. Delta lake parquet files live in the table root directory by default. This change is required if the iceberg table is required to be read by Google BigQuery, see this section of the docs for details.

The next statment writes the DataFrame as a delta table but adds the writer options to enable universal format and in our case it is iceberg. The config value delta.universalFormat.enabledFormats is supposed to have a comma-separated list of table formats to enable. Currently, it is only iceberg that is supported but hudi should come soon.

After running the snippet, the GCS table folder should have the classic _delta_log folder for delta lake tables and metadata folder for iceberg.

Image by author

In most delta tables, parquet files live in the root table folder unless table is partitioned but there is no issue with keeping them somewhere else and adapt file paths in commit files. The below screenshot is the content of _delta_log/00000000000000000000.json. The same idea is used with features like table shallow cloning to have file paths point to another location as a completely absolute path not a relative path.

Image by author

Iceberg metadata is written to metadata folder asynchronously and there are some aspects to consider around concurrent writes and frequent delta lake commits and how those relate to iceberg metadata generation.

Image by author

Read data as a delta lake or iceberg on Spark

Before even going for reading the table by other tools as an iceberg table, let’s read it by Spark once as delta lake format and once as iceberg format. Run the below snippet in PySpark shell.

(
spark
.read
.format("delta")
.load("gs://my-lakehouse/my-uniform-table")
.describe()
.show()
)

(
spark
.read
.format("iceberg")
.load("gs://my-lakehouse/my-uniform-table")
.describe()
.show()
)

The above snippet should produce the below output. There might be some spark logging lines thrown here and there which can be muted by running spark.sparkContext.setLogLevel("Error").

Image by author

It is clear now that the data written to GCS folder is readable by Spark as both a delta lake table and as an iceberg table. Next, let’s see if the same data can be read by other applications that understand iceberg.

Read the data from BigQuery

There are two options to register the bucket data as an external table on BigQuery. The first is easy and it just requires creating an external table pointing to a specific commit file in iceberg metadata folder. The second approach involves using a metastore which is outside the scope of this post but it is the recommended approach as the first method requires recreating the external table with every new commit in iceberg metadata folder.

Let’s use the first option. Inside a BigQuery editor, run the below SQL statement. Replace <your-project> with the name of your BigQuery project and also use your own bucket name. Table name is arbitrary here.

CREATE OR REPLACE EXTERNAL TABLE `<your-project>.default.people`
OPTIONS (
format = 'iceberg',
uris = ['gs://my-lakehouse/my-uniform-table/metadata/v1.metadata.json']
);

In another BigQuery editor, run a SELECT statement using fully qualified name of the newly created external table. BigQuery will be able to read the data as if it is a normal iceberg table.

Image by author

Back to our PySpark session, let’s update the delta lake table and make name column values in uppercase.

spark.sql("UPDATE delta.`gs://my-lakehouse/my-uniform-table` SET name = upper(name)").show()

The cool thing with Spark+delta lake integration is that we can treat the folder as a table name by prefixing it with delta. and then use SQL statements to manipulate the whole table. The result of spark.sql is a DataFrame and in our case it shows some metrics of the DML operation.

Image by author

If you run the same SELECT statement on BigQuery against the external table you will see no change. This is because the external table is statically linked to the first commit (number 0 in delta lake and number 1 in iceberg). Just rerun the CREATE OR REPLACE EXTERNAL TABLE statement but this time use v2.metadata.json.

Now the SELECT statement on the external table should reflect latest changes and all values in name column will be in uppercase.

Image by author

Wrap up

The universal format feature is still in preview and there are some limitations but nonethless very useful. There is a bit of plumbing needed specially if you are not on Databricks environment but still very approachable elsewhere as well. The ultimate dream here is similar to the lakehouse mantra of having a single copy of data and single source of truth. Now we eveolve that into a single copy of data that is not bound to a specific storage format ecosystem.

Update Jan 2024

The follow up post Building a lakehouse on Google Cloud sans Databricks uses universal format along with BigLake metastore for better integration.

Resources

https://docs.delta.io/3.0.0rc1/delta-uniform.html

Create Apache Iceberg BigLake tables | BigQuery | Google Cloud

--

--

Yousry is a lead consultant working for Cuusoo. He is very passionate about all things data including Big Data, Machine Learning and AI.