Published inLevel Up CodingResurrecting Scala in Spark : Another tool in your toolbox when Python and Pandas sufferSpark Dataset API is still useful to handle some edge cases that require extra flexibility but still run super fast.Jan 3Jan 3
Published inLevel Up CodingStop using plain PySpark UDFs : No one likes slow cars! Part IIQuick insights about how Pandas UDFs workSep 7, 2024Sep 7, 2024
Published inLevel Up CodingStop using plain PySpark UDFs : No one likes slow cars!How complex logic can be still implemented using out of the box Spark functions with lightning fast performance.Jul 18, 20243Jul 18, 20243
Published inLevel Up CodingDelta Lake Liquid Clustering — A visual explanationHow to optimize lakehouse data storage layout with minimal effort.Jan 28, 20243Jan 28, 20243
Published inLevel Up CodingBuilding a lakehouse on Google Cloud sans DatabricksCombining Delta Lake, Iceberg and BigLakeDec 17, 2023Dec 17, 2023
Published inLevel Up CodingNavigating the Void: Unraveling the Mysteries and Pitfalls of the ‘Void’ Data Type in Apache SparkExploring how a void column appears into Spark DataFrames and what could be the implications.Dec 5, 2023Dec 5, 2023
Published inLevel Up CodingSetting up a PySpark local developmet environment for Dataproc serverlessSmooth dependency managment for local developmet and production jobsAug 29, 2023Aug 29, 2023
Published inLevel Up Codingdbt tests vs Delta Live Tables expectations : a click bait to Spark observable metricsComparing dbt and DLT tests performance and corelating to Spark observable metricsAug 5, 2023Aug 5, 2023
Published inLevel Up CodingDelta Lake Universal Format — A First LookWrite as delta lake — Read as icebergJul 15, 2023Jul 15, 2023
Published inLevel Up CodingBack to basics : Spark caching key ideas!Foundational concepts about how Spark caches Dataframes vs RDDsMar 13, 2023Mar 13, 2023