In this blog post, I show you how easy it is to start an Azure Synapse Analytics workspace (instance) and use its Serverless SQL Pool engine to analyze sample publicly available data. As you will read shortly, Azure Synapse Analytics provides many compute engines for different use cases. The easiest one to get started with is its Serverless SQL Pool since every Azure Synapse Analytics instance comes with one already created and ready to use. It also does not have any cost unless if you use it which makes it very attractive to those who have a limited Azure budget.
Here is a high-level definition for Azure Synapse Analytics from Microsoft Documents:
I have highlighted a few key points that I like to review:
Limitless: It means you can scale its services up easily. Scaling up will cost you more, but scaling up can happen within a few minutes and put a tremendous compute power within your reach; whereas on-prem, even when budget is not an issue, most of us are limited and slowed down by having to go through our company’s purchasing department and/or more importantly it takes time to buy/add resources. In Azure, scaling up is a mater to moving a sliding bar to a higehr level.
Brings together: Synapse is a collection of various products/capabilities under one interface. Some, like Azure Synapse Serverless SQL Pools and Azure Synapse Notebooks are brand new to Synapse. Some, like Azure Databricks, Azure SQL Datawarehouse, and Azure Data Factory, existed elsewhere in Azure before Synapse. Here is a summary of these capabilities:
|Technology/Service||Before Azure Synapse||With Azure Synapse|
|Data Integration (ETL)||Azure Data Factory/Azure Databricks||Azure Synapse Pipelines / Notebooks|
|Enterprise Data Warehousing||Azure SQL Datawarehouse||Azure Synapse Dedicated/Serverless SQL Pool|
|Big Data Analytics||Azure Databricks||Azure Synapse Spark Pool|
Unified Experience: Obviously, if you have an existing investment with any of the tools that predate Synapse, you will want to look into pros/cons of moving hose parts into Synapse; however, for new projects, Synapse makes it easy to start with one interface that provides all of these services to you: Ingest/Prepare/Transform/Explore/Manage/Serve. There may be some gaps in terms of features between a products counterpart outside of Synapse vs the one inside Synapse (Azure Data Factory and Azure Synapse Pipelines); however, the gaps have been closing over the past year and the remaining ones may not be relevant to most projects.
Now that we have a baseline definition, we will first go over how to provision an Azure Synapse Analytics instance followed by how to use it to query some sample data.
Note: My goal in this post is to set up a Synapse workspace for learning and exploring and as such I have followed most of the default settings. If you are setting up a workspace in a production environment, you must pay close attention to some of the network/security/permissions settings. The interface is good about giving you warnings and more resources to read.
Every resource in Azure must be a tied to a subscription which is how money spent is tracked. If you have an Azure subscription, the rest is easy. If you don’t, you can use this link to get a free account for a limited time.
Go to https://portal.azure.com/ and click on Create a resource:
In the Search bar, type Synapse, and choose Azure Synapse Analytics from the dropdown:
Click on Create:
This will take you to the Create Synapse workspace screen which has several tabs. We will go through each tab next and set the properties.
The first tab is the Basics tab. This is where you will give the Synapse Workspace (instance) a name and choose which Azure subscription and Resource Group it belongs to.
A subscription is what keeps track of your charges (money spent), this is how you pay Microsoft for your resources.
For Azure Subscription, chose the subscription you like to use. The following screenshots shows a test subscription I am using for this post, called “Main One”. You will see your own subscriptions in the dropdown.
Next you need to specify a resource group. A resource Group is a way of logically grouping Azure Resources. For instance, you can group resources of the same type (All Azure SQL DBs, all Azure Synapse instances…), or you can group resources that belong to the same business unit or use any other arbitrary logic to group them. Resource groups allow you to manage/monitor a collection of resources together.
For the Resource Group dropdown, you can either choose an existing one or click on the Create New link, shown under the Resource Group box. I created a new one called SynapseAnalyticsRG. In future, if I create more instances (workspaces), I will assign them to the same resource group to keep an eye on total cost.
Note: Typically, you don’t have to specify a name for the Managed resource group. Azure Synapse will create one for you behind the scenes for metadata storage. If you have organizational rules that enforce how all Azure resource names should be defined, you have the option of specifying the name. In this post I am skipping it.
Next you will choose a workspace name and Azure region. There are a couple of rules you have to follow for the name, as shown below:
Here I chose the name “synapserocks“! (You can probably tell I like Synapse :))
For the Region, choose the region that holds most of your other Azure resources. Typically, this is the one that is closest to your company’s geographical location or where you live if you are testing this for yourself.
Next, specify a storage account (disk space in the cloud) for the Synapse workspace to use. If you already have a storage account, you can choose it from the dropdown; otherwise, you can create a new on by clicking on Create new link under Account name dropdown.
Finally, choose a file system within the storage account. Think of a file system as a set of subfolders under the main storage account. If you have a file system that you have previously created in this storage account, you can choose if from the dropdown; otherwise, click on the Create new link under the File system name dropdown to create one.
Pay attention to the message in the blue box. You will need these instructions later, if you decide to give other users access to your Synapse workspace and the file system. For now, our goal is to start an Azure Synapse workspace and analyze some data ourselves.
Note: If working with others, later in Azure Synapse Studio, you should give them Contributor and more roles in Synapse and also give them Storage Blob Data Contributor in the storage account so that they can read/write the files.
Here some links that are useful to bookmark for future:
Click on the “Next: Security>” button at the bottom of the screen to go to the Security tab.
If this button is not working, that means you have forgotten to fill in one or more of the required fields.
Keep all the default settings and specify a SQL admin login. You can use this login to connect to the Serverless SQL Endpoint and do admin tasks such as create SQL Logins if needed. For the rest of the settings, continue with the defaults and click on Next: Networking.
For those of you who are curious about the section for System Assigned Managed Identity, here is a link to read more: Connect to a secure storage account from your Azure Synapse workspace – Azure Synapse Analytics | Microsoft Learn
Note that if you followed the basic default settings, this feature is disabled at this point, and you don’t have to worry about it for this learning/testing environment.
Under networking tab, choose the following settings.
If you don’t have a network person to help you understand the options, it is safer and easier to enable Managed Virtual Networks. Read more here: Managed virtual network – Azure Synapse Analytics | Microsoft Learn
Note: We are setting up this environment for testing and learning. If you are doing this for a production environment, make sure you read and understand all the links on this page.
Review + Create tab
Here you get a chance to review everything.
Note that cost says Serverless SQL est. cost/TB = $5. Serverless SQL Pool does not have a constant cost, if you don’t use it there is not cost. When you use it (run queries) it calculates the cost at $5 per 1 TB of data processed. Keep in mind that the storage account you specified has its own cost. Azure Storage Data Lake Gen2 Pricing | Microsoft Azure
Finally, when you are ready, click on …
This will take a few minutes. For me it took 5 minutes.
While the deployment is going on, you can look at the Template it created. You can use this template for automation later on to create another resource with the same settings.
Go back to the Overview tab to see the status of the deployment. Once the deployment is complete, you will see the following message.
From the above screen click on Go to resource group button. You should see your instance of Synapse listed along with the storage account.
Once you click on the resource name, it will take you to the following page which you should bookmark. This page gives you the link to Azure Synapse Studio for this Synapse instance and bunch of other useful information.
Also notice that your Synapse environment already has created a Serverless SQL Pool instance. It is called Built-in. Azure Synapse Serverless SQL Pool is a massively parallel processing/ distributed engine which is now at your fingertips! This is the compute engine you are going to use to analyze sample data.
Click on the workspace URL and bookmark it. This will take you to Azure Synapse Studio.
Using Azure Synapse Studio to analyze sample data
Azure Synapse Studio is the tool that lets you access all different engines/parts of Azure Synapse Analytics. It is organized into hubs on the left. First you need to get some sample data to use by going to Knowledge Center.
On the Home hub, in the bottom of the screen, click on Knowledge Center.
From Knowledge Center, choose Browse Gallery
In the Gallery, go to Datasets tab, type in NYC in the search bar to limit results to NYC Taxi datasets. Click on NYC Taxi & Limousine Commission – Yellow taxi trip records and click on Continue.
In the next screen, you will see a description of the dataset. Click on Add dataset button at the bottom of the screen.
This will take 1-2 minutes and will take you back to Azure Synapse Studio under the Data hub.
Data hub is where you can see all the data that this Synapse workspace can work with.
There are two tabs here:
- Workspace tab gives you access to SQL/Spark/Kusto databases which we are not getting into now.
- Linked tab gives you access to any storage accounts that you have access to. These can be local/organizational storage accounts listed under Azure Data Lake Storage Gen2 or other remote storage accounts.
In our case, the sample dataset we just loaded is listed under Linked tab under Azure Blob Storage. The dataset itself is not physically copied to this workspace. What you did, created a shortcut to a publicly accessible storage account from Microsoft with NYC Taxi data in it.
Now that you have this sample data shortcut, all you have to do to analyze it is to click on the ellipsis in from of it and choose New SQL Script -> Select Top 100 rows.
This will open a query screen and automatically choose the Built-in which is the only compute engine we have at the moment.
Here is a closer look at the auto getenated code:
-- This is auto-generated code SELECT TOP 100 * FROM OPENROWSET( BULK 'https://azureopendatastorage.blob.core.windows.net/nyctlc/yellow/puYear=*/puMonth=*/*.parquet', FORMAT = 'parquet' ) AS [result];
This is a typical format that you will see when working with Synapse Analytics SQL Pools. It uses an OPENROWSET command to access files in a storage account. You can read one file or a group of files as you can see in this example which it uses wild cards. The “/puYear=*/puMonth=*/*.parquet” means that you are asking for all files for all years/all months.
Note that the database name does not matter in this case either. We are running a query against a remote storage account. Click on Run to see the results.
If you look under the Message tab, you will see information about how much data was scanned, moved, written. Remember that total cost for a query in Synapse Serverless SQL Pool is calculated by the amount of data processed which is data read/moved/returned. It is $5 for a TB of data.
You can also see the cost per query from Monitor Hub under SQL Requests.
If you can limit the amount of data scanned to answer a query, you can save on costs. This is a topic for a different blog post but in short, it is best practice to only access/read what you need and pay attention to partitioning keys when available. Typically for large datasets you will find the data partitioned into folders by Year/Quarter/Month/Date or a combination of those. In the sample data we are working with PuYear and puMonth are the paritioning keys. If you can reference those columns in your query to limit data when applicable, you can save on costs. This is a more advance topic, and you don’t have to worry about it for the sample data we are working with here.
If you would like to read more on this topic, here is a link to a post where I talk about setting up incremental refresh from Power BI using Azure Synapse Analytics SQL Pools as a source where data is partitioned by Year.
Saving sample Query
From this point on, you can save this query for future use. To save the query in Azure Synapse Studio, click on Publish all button.
This will save the query under the Develop Hub.
Alternatively, you can create a Serverless SQL Pool database and save the code as a view and use it in other queries or reference it from other reporting tools.
In this post, you saw how to provision an Azure Synapse Analytics instance and start working with sample data from the Gallery. Once the sample dataset was created (as shortcuts, no data was actually copied), you were able to use the built-in Serverless SQL Pool engine to analyze it using SQL. This is all you need to do if you are using Azure Synapse Studio to analyze Parquet/Delta files or alternately you can create a database (other than master) in your Serverless SQL Pool and create views in it using similar code you saw.