Synapse Serverless SQL Vs. Lake Database

When designing a Lakehouse solution in Azure, the number of options are growing every day. Should you use Databricks or Synapse? Delta Tables or Parquet? How will the data be served? Dedicated Pools, Power BI, Serverless SQL? These questions can be overwhelming, especially since the differences between each option are so subtle. In this blog, I compare two of the most common features in a Synapse Lakehouse solution built on top of Delta tables: Serverless SQL Pools and Lake Databases. My goal is to help you navigate the available features, and even compare them to their counterparts offered by Databricks.

So what does Serverless mean?

When you hear the term ‘Serverless’ in Azure, this means that the compute (managed by Microsoft) is automatically scaled up/down based on traffic, and goes to sleep when not in use. Synapse comes with a ‘Built-In’ serverless pool that is completely free for the first 1TB of data queried, and only $5/TB after that. This is a very cost effective alternative to a Dedicated SQL Pool or Databricks SQL Warehouse.

In a typical Lakehouse solution, external tables or views can be built on top of underlying Delta tables in the datalake, decoupling the storage and compute of your databases. This means that you can do many of the things you could do in a SQL database, while keeping your physical data in the datalake. Serverless objects can be queried by analysts or reporting tools such as Power BI via an exposed SQL endpoint. In Azure Synapse, there are two types of databases that are served by the built-in endpoint, Lake Databases and Serverless SQL Pools.

Lake Database

Lake Databases contain external tables or views that sit directly on top of the Delta files in your datalake, and can be natively queried and managed using Synapse Spark notebooks. When you create a managed or external table using SparkSQL, the table is created in the Lake Database. These objects are synchronized to the Serverless SQL Endpoint, but access control is managed through the storage account and Synapse Workspace permissions.

Serverless SQL

Serverless SQL Pools contain similar external tables and views, but also provide SQL Functions and Stored Procedures. The SQL Pool is a logical representation of your datalake that offers many of the familiar benefits of traditional SQL databases such as T-SQL syntax and user/role based access.

Tale of the Tape

To borrow a boxing term, here are how each of the products stack up against each other in terms of managing, querying, and securing the data.

You can see that many of the features between the two types of databases overlap, but there are subtle differences. For instance, Lake & SQL databases can both be queried via the Synapse Serverless Endpoint and the underlying data can be manipulated in Spark. But if you wanted to drop or create a SQL table you must do so in a SQL Editor, while Lake tables can be managed using SparkSQL. The security features are also different; Access to Lake Database tables is inherited from the underlying files in the Storage Account, while access to Serverless SQL objects uses a more traditional SQL access model with roles and permissions.

What about Databricks?

I can write a whole blog about Synapse Vs. Databricks, but as things stand today it looks like “neither can live while the other survives⚡”. Although it does seem like Synapse is usually just a step behind Databricks when adding new capabilities.

This is also true when it comes to the Delta Lake. While the storage and scale capabilities of Azure are hard to beat, the Databricks implementation of the Delta Lake is a lot less confusing. Instead of two separate types of databases, Databricks has a centralized metastore that incorporates the best parts of Lake & Serverless databases. For instance, Delta tables automatically get added to the metastore when they are created in Databricks and table permissions or Row-level security can be managed in the Databricks workspace using an intuitive SQL syntax. This reduces the need to decide between Lake & SQL databases and also prevents the duplication of objects across the two databases.

The flexibility and availability of a true SQL Server endpoint makes Synapse Serverless a powerful Lakehouse solution for your Delta Lake. However the distinction between Lake & SQL databases is not very clear, especially since many of the features overlap between the offerings. In my opinion, Microsoft could benefit from combining the two into a single Lakehouse product that is integrated across the entire Synapse platform without having to make a choice for particular uses. I would not be surprised if this product sees a lot of enhancement over the next year to keep up with Databricks and fully leverage the power of Microsoft’s cloud infrastructure.

Previous
Previous

Databricks Streaming: Part 1

Next
Next

Designing the Data Model: Where to Start?