Support for Synapse Spark workloads? #27

ernestoongaro · 2024-09-13T10:57:54Z

Hello! Does this adapter work with only the Fabric Livy endpoints or would it also work with Spark Livy endpoints?

fdmsantos · 2024-11-12T12:04:57Z

Did you have success with this? Im looking for the same.., I would like use DBT on Synapse Spark Pools (Not Fabric)...

sugibuchi · 2024-11-26T22:58:11Z

I recently prototyped a DBT adapter for Synapse Spark pool. It is a fork of dbt-spark, but the extended part is written from scratch without reusing Cloudera's code. It currently works with the OSS Apache Livy (convenient for local tests) and the Synapse Spark pool (and I think it won't be challenging to adapt to HDInsight and Fabric as well).

Unfortunately, I cannot share the source code for now as this is an internal POC in our company. However, I can share some ideas based on this experience. My short answer is "not so easy".

The main challenge was the slowness of session creation in the Synapse Spark pool. As you also experienced, it is really slow. On the other hand, DBT opens and closes a DB connection for every task. Therefore, we had to find a way to (1) reuse the same Spark session across all tasks in a DBT run and (2) prevent zombie sessions that are not appropriately closed after the DBT run.

My current strategy is "don't create (if it exists), don't stop." When establishing a DB connection, my prototype lists available sessions in a specified Spark pool and tries to find an available session with a name defined in profiles.yml. If such a session exists, the adapter uses it. Otherwise, it creates a new one with the name.

My prototype intentionally does not stop sessions after a DBT run. The remaining session will be reused by the next run or will automatically stop by idle timeout.

Another challenge is the latency of Livy API. Synapse's Livy endpoint imposes several seconds of latency between executing each Livy statement. Due to this latency, the current version takes more than 60 seconds (with 4 threads, not including time for session spin-up) to build the Jaffle Shop example compared to 20 seconds with local Spark+Livy.

I do not know how to reduce this latency (multithreading helped, but not enough). I will test the prototype with bigger use cases and hopefully share more insight.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Synapse Spark workloads? #27

Support for Synapse Spark workloads? #27

ernestoongaro commented Sep 13, 2024

fdmsantos commented Nov 12, 2024

sugibuchi commented Nov 26, 2024 •

edited

Loading

Support for Synapse Spark workloads? #27

Support for Synapse Spark workloads? #27

Comments

ernestoongaro commented Sep 13, 2024

fdmsantos commented Nov 12, 2024

sugibuchi commented Nov 26, 2024 • edited Loading

sugibuchi commented Nov 26, 2024 •

edited

Loading