Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Synapse Spark workloads? #27

Open
ernestoongaro opened this issue Sep 13, 2024 · 2 comments
Open

Support for Synapse Spark workloads? #27

ernestoongaro opened this issue Sep 13, 2024 · 2 comments

Comments

@ernestoongaro
Copy link

Hello! Does this adapter work with only the Fabric Livy endpoints or would it also work with Spark Livy endpoints?

@fdmsantos
Copy link

hi @ernestoongaro ,

Did you have success with this? Im looking for the same.., I would like use DBT on Synapse Spark Pools (Not Fabric)...

@sugibuchi
Copy link

sugibuchi commented Nov 26, 2024

I recently prototyped a DBT adapter for Synapse Spark pool. It is a fork of dbt-spark, but the extended part is written from scratch without reusing Cloudera's code. It currently works with the OSS Apache Livy (convenient for local tests) and the Synapse Spark pool (and I think it won't be challenging to adapt to HDInsight and Fabric as well).

Unfortunately, I cannot share the source code for now as this is an internal POC in our company. However, I can share some ideas based on this experience. My short answer is "not so easy".

The main challenge was the slowness of session creation in the Synapse Spark pool. As you also experienced, it is really slow. On the other hand, DBT opens and closes a DB connection for every task. Therefore, we had to find a way to (1) reuse the same Spark session across all tasks in a DBT run and (2) prevent zombie sessions that are not appropriately closed after the DBT run.

My current strategy is "don't create (if it exists), don't stop." When establishing a DB connection, my prototype lists available sessions in a specified Spark pool and tries to find an available session with a name defined in profiles.yml. If such a session exists, the adapter uses it. Otherwise, it creates a new one with the name.

My prototype intentionally does not stop sessions after a DBT run. The remaining session will be reused by the next run or will automatically stop by idle timeout.

Another challenge is the latency of Livy API. Synapse's Livy endpoint imposes several seconds of latency between executing each Livy statement. Due to this latency, the current version takes more than 60 seconds (with 4 threads, not including time for session spin-up) to build the Jaffle Shop example compared to 20 seconds with local Spark+Livy.

I do not know how to reduce this latency (multithreading helped, but not enough). I will test the prototype with bigger use cases and hopefully share more insight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants