UNISYS FIRST COURSE (MICROSOFT AZURE SYNAPSE ANALYTICS)
Data Engineering on Microsoft Azure: Synapse Analytics
Azure Synapse Analytics is an analytics service that provides functionality for data integration, enterprise data warehousing, and big data analytics. Services provided include ingesting, exploring, preparing, managing, and serving data for BI and machine learning needs. In this course, you'll learn about the Azure Synapse Analytics platform and how it is used for data warehousing and big data analytics. Next, you'll learn how to create a Synapse Workspace, a dedicated SQL pool, and a serverless Apache Spark pool. You'll move on to explore how to analyze data using a dedicated SQL pool, Apache Spark for Azure Synapse, Serverless SQL Pools, and a Spark database, as well as how to analyze data that is in a storage account. You'll learn how to integrate pipelines using Synapse Studio, visualize data using a Power BI workspace, and monitor a Synapse Workspace. Finally, you'll learn about the Synapse Knowledge Center and the features of Azure Synapse Analytics and PolyBase. This course is one in a collection that prepares learners for the Microsoft Data Engineering on Microsoft Azure (DP-203) exam.
Table of Contents
Course Overview
Hi, I'm Bill Brooks, I've held various positions in the Software field for over 20 years, including Computer Engineering Technology instructor, software developer, software architect, and team lead. I have extensive experience
[Video description begins] Your host for this session is Bill Brooks. He is a senior software developer. [Video description ends]
with the Microsoft development stack and as a cloud developer, primarily in the Microsoft Azure environment. I'm a strong proponent of the Agile approach to development and have been both a PO and a Scrum Master. I hold a Bachelor's degree in mathematics from Concordia University in Edmonton and a Computer Engineering Technology diploma from NAIT. Azure Synapse Analytics is an analytics service that provides functionality for data integration, enterprise data warehousing and big data analytics. Services provided it include ingesting, exploring, preparing, managing, and serving data for BI and machine learning needs.
In this course, I'll discuss the Azure Synapse Analytics platform and how it's used for data warehousing and Big Data analytics. Next, I'll create a Synapse Workspace, a dedicated SQL pool and a serverless Apache Spark pool. I'll then move to analyze data using a dedicated SQL pool, Apache Spark for Azure Synapse, Serverless SQL pools, and a Spark database, and analyze data that's in a Storage Account. Finally, I'll integrate pipelines using Synapse Studio, visualizing data using a Power BI workspace, monitoring a Synapse Workspace, exploring the Synapse Knowledge Center, and the features of Azure Synapse Analytics and PolyBase. This course is one in a collection that prepares learners for the Microsoft Data Engineering on Microsoft Azure DP 203 exam.
Azure Synapse Analytics
In this video, I'm going to discuss Azure Synapse Analytics. Azure Synapse Analytics performs Data integration by supporting a large range of input and output sources and making it simple to move data between these different technologies. It also acts as a Data warehouse storing data in SQL and Spark pools. It brings tools together for Big data analytics such as Transact SQL and Apache Spark. Synapse SQL extends the standard T-SQL to include support for machine learning models, combining serverless and dedicated resources. To best suit your predictable and unpredictable data flows and built-in functionality to handle data streaming. Synapse Analytics is a fully integrated environment that enables data ingestion from over 90 data sources. It contains an environment for code-free drag and drop creation of extract transform load, or ETL data pipelines. This environment is very similar to the environment in Azure data factory.
This equivalent environment in Synapse Analytics means that you don't need to spawn a data factory instance to build ETL pipelines. You can simply do it right in Synapse Analytics. The environments let you orchestrate logic through notebooks, Spark jobs, scripts, and several other methods. Synapse Analytics offers a single unified experience at an organizational level. All administration from data management to exploring the data to visualization can be done within the same Synapse Analytics UI. Azure role-based access control secures the environment and this common approach is easy to manage. Synapse Analytics offers monitoring tools that spans Spark and SQL environments, and you can integrate Synapse Analytics with your CI/CD pipeline so you can manage your SQL and Spark code lifecycles. In CI/CD Repository's, along with the rest of your solution code. Apache Spark plays a big role in Synapse Analytics.
It's a very popular platform for Data engineering since it has sophisticated and efficient means for manipulating data. It's often used for preparation of incoming data. To get it ready for downstream analysis. It supports Machine learning by integrating such tools as Spark ML and Azure ML, and it enables ETL data pipelines to get data from Ingress to final analysis. Synapse Analytics makes using Apache Spark easier. By handling the management load for clusters so you can focus on your code and you can also take advantage of the integration with .Net for Spark. Synapse Analytics serverless Spark pools allow you to run Apache Spark in two modes. Spark Notebooks can be run on the fly by data scientists for data engineering. This support C#, PySpark, Scala, and Spark SQL languages. Spark jobs can also be set up, which can run your code on a schedule without human intervention for automated processes such as data pipelines.
One of the nice aspects of Synapse Analytics is how it integrates SQL and Spark Technologies under one consolidated experience, you can use both to seamlessly interact with your data lake by directly reading different files stored in your data lake, such as JSON, CSV and Parquet loading data between different SQL and Spark databases is made efficient and scalable. There are numerous industries that can use Synapse Analytics. A few with them are the Financial industry such as Banks, Manufacturing, Retail stores and the Healthcare industry. Data gets into Synapse Analytics via a link service which connects external resources to the workspace. There is no limit to how many link services can be set up, and there are dozens of different source types that link services can connect to. Once the data is in Synapse Analytics, it can be run through a Pipeline, which is a data integration process consisting of several steps or activities, each which perform a particular action such as copying data or performing some transformation on the data set.
Activities work with the data via an Integration dataset. This is a reference to the data of a link service pipelines are executed via Triggers, which can be scheduled based on time Windows or fired via an event, such as Blob data being written to a storage location, Synapse Analytics contains a special engine called Data flows which lets you build data pipelines graphically without writing any code.
Integrating With Pipelines
In this video, I'm going to demonstrate how to create an Integrated Pipeline in Synapse Analytics. So I'm on my Azure Home screen and you're going to need a
[Video description begins] A Microsoft Azure homepage displays with a search bar on the top. The working pane constitutes 3 sections namely: Azure services, Recent resources, and Navigate. The Azure services section contains various options such as: Create a resource, Resource groups, Subscriptions, Azure Active Directory, and App Services. The Recent resources section includes a table with the following column headers: Name, Type, and Last viewed. It lists various entries including, synapsedl2255, synapse2255, and Demo. [Video description ends]
Synapse workspace for this. I already have one set up under Recent resources mines called synapse2255 and it comes with a Storage account that's called synapsedl2255. So let me show you how you create your own Synapse workspace. If I go to the Create a resource button at the top left, click on that. It takes me to the Create a resource page
[Video description begins] A page opens with the heading Create a resource. It contains a search bar. The left pane includes various options such as: Get started, Compute, and Identity. The main pane displays a column with the header Popular. [Video description ends]
and in the Search box, I'm going to type synapse analytics and in the drop-down
[Video description begins] As he types synapse analytics in the search bar, a drop-down appears. It lists the following options: Datometry Hyper-Q for Azure Synapse Analytics, Azure Synapse Analytics (private link hubs), and Azure Synapse Analytics. [Video description ends]
and I'll choose Azure Synapse Analytics on the Azure Synapse Analytics page
[Video description begins] The Azure Synapse Analytics page opens. It contains a Create button. [Video description ends]
will click Create. Now I'm on the Create Synapse workspace page on the Basics tab, and under Project details I need to choose a Subscription.
[Video description begins] A page with the header Create Synapse workspace opens. The following breadcrumb displays above the page header: Home > Create a resource > Azure Synapse Analytics >. It includes the following tabs: Basics, Security, Networking, Tags, and Review + create. The Basics tab is active. [Video description ends]
It's already chosen the one I have, under Resource group I can create a new
[Video description begins] The Basics tab includes two segments named Project details and Workspace details. The Project Details segment contains two drop-down options labeled as: Subscription and Resource group. It also has a Create new button and a field named Managed resource group. [Video description ends]
resource group or choose one I already have so I'm going to choose an existing one. You can leave Managed resource group empty under Workspace details,
[Video description begins] As he clicks on the drop-down, an option named it_cldema appears. He selects this option. [Video description ends]
[Video description begins] The Workspace details segment includes a field for name. It also contains drop-down options for Region, Account name, and File system name. There is a Create new button under the Account name and the File system name options. [Video description ends]
you need to give it a workspace name that's unique, so something like synapse6677, just a random name. Region choose your Region. Under Select Data Lake Storage Gen2, fill in an Account name. You'll probably want to create a new one so you can click Create new and give it a unique Name and what this will be is the Data Lake or Storage account that I showed you in my Recent resources it creates it to save the Synapse Analytics details, so click OK. File system name, this is the container that it will create inside that Storage account. So click Create new and File system name and give it the Name files and then click OK. Now on the bottom left click Review and Create. At the top, it will say Validation succeeded and then at the bottom left you can click Create.
Now as I said, I've already created mine, so I'm just going to click Home at the top left. So now that you've created a Synapse workspace,
[Video description begins] The Microsoft Azure homepage opens again. [Video description ends]
you're going to have two resources you're going to have the Synapse workspace which you can see under my Recent resources and as I mentioned the Storage account. So open up the Storage account and when I click on my Storage account,
[Video description begins] A new page titled synapsedl2255 displays. The left pane includes various options such as: Overview, Activity log, and Tags. [Video description ends]
I now on my Overview of Storage account. I'm going to scroll down and click Containers and on your containers you'll see one container.
[Video description begins] As he scrolls down, four options appear in the boxes. They are as follows: Containers, File shares, Tables, and Queues. [Video description ends]
You should see a container called files and that's where Synapse Analytics stores its data.
[Video description begins] The synapsedl2255 | Containers section opens. The main pane comprises a table with the following column headers: Name, Last modified, Public access level, and Lease state. The table lists four containers, namely: files, stage1, stage2, and stage3. [Video description ends]
I've created 3 new containers, stage1, stage2, and stage3 and all three of these are empty right now, and we're going to be creating a pipeline that copies files from stage1 to stage2 as the first step of the pipeline, and then copies the same files from stage2 to stage3 for the second part. So I'm going to click Home at the top left and under my Recent resources, I'll click my Synapse workspace.
Now I'm on the Overview of my Synapse workspace
[Video description begins] A new page titled synapse2255 displays. The left pane includes various options such as: Overview, Activity log, and Tags. [Video description ends]
on the right-hand side, I'm going to scroll down and in the Getting started section there's an Open Synapse Studio box and I'm going to click the Open link on that and now I'm in Azure Synapse Analytics. [Video description begins] As he scrolls down, two options appear in the boxes. They are as follows: Open Synapse Studio and Read documentation. [Video description ends]
[Video description begins] A page titled Synapse Analytics workspace synapse2255 opens. The left pane includes various buttons. [Video description ends]
So on the left hand side, 4th button from the top is a button that says Integrate when you hover over it. I'm going to click on that so just to the
[Video description begins] A section titled Integrate displays. It includes a search bar and a plus sign. [Video description ends]
right I get an Integrate section and at the top right of that there's a plus sign. I'm going to click that and choose Pipeline.
[Video description begins] A drop-down with the following options appears: Pipeline, Copy Data tool, and Browse gallery. [Video description ends]
This creates a new pipeline for me, by default, it's called Pipeline 1.
[Video description begins] The left pane now displays the Pipelines section. The middle pane shows a list of Activities. Some of them are: Synapse, Move & transform, and Azure Data Explorer. The right pane includes the following buttons: Validate, Debug, and Add trigger. [Video description ends]
To the right of the Integrate section is an Activities list, and this hierarchy has a section called Move & transform and if I open that up. I'm going to drag a Copy data block
[Video description begins] The Move & transform activity expands to show two options, namely: Copy data and Data flow. [Video description ends]
to the right-hand side, which is our pipeline area.
[Video description begins] The right pane now includes a box with the text Copy data1. There are various tabs under this box. Some of them are: General, Source, and Sink. The General tab contains fields for Name, Description, and Timeout. [Video description ends]
I'm going to drop it on there and below it, I get some options.
So first off, I'm on the General tab for the options below and for Name, I'm going to call this copy stage1 to stage2,
[Video description begins] The text in the box now reads: copy stage1 to stage2. [Video description ends]
and then I'm going to click on the Source tab to the right of General and I get below that Source dataset and I'm going to click the New button which is below the Select. Now blade pops out to the right, called New Integration dataset. In the Search box I'm going to type blob and I get an option for Azure Blob Storage and I'm going to click that and click Continue at the bottom. Now I get to Select format option. I'm going to click Binary because we'll be copying the files. We won't be actually reading them and at the bottom I'll click Continue. Now I'm on Set properties. The Name I will just call Stage1 and Linked service if I open that up, I'm going to click New. In this creates a connection to our Azure Blob Storage.
[Video description begins] The blade now shows New linked service (Azure Blob Storage) screen. It contains fields for Name and Description and various drop-down options. [Video description ends]
So if I scroll down in the New linked service screen that I have just under Azure subscription is Storage account name.
I'm going to choose that and choose my Storage account which is in my case synapsedl2255 and it'll be whatever you called it and then at the bottom, I'll click Create so that successfully created my link service. I'm now connected to blob storage.
[Video description begins] The blade shows the Set properties screen again. It includes the following fields: Name, Linked service, Connect via integration runtime, and File path. [Video description ends]
Now I have an option that says File path under Set properties and I'm going to click the Browse button just to the right of that, and I get a Browse blade and I'm going to choose a stage1 folder because that's what we're creating a data set folder, and I'm going to click OK at the bottom. Then on Set properties, I'll click OK at the bottom, so that's now set up my Source, which is going to be the Stage1 container of my blob storage. Now back here on the Pipeline, in the options at the bottom for the my Copy data block, I'm going to choose Sink and then at the bottom I choose a Sink dataset, and again I'm going to click New. I get the New integration dataset blade on the right. I'm going to choose Azure Blob Storage and click Continue at the bottom for Select format, I'm going to choose Binary and click Continue at the bottom.
For Set properties, I will give it a Name of Stage2. For Linked service I'm going to choose my AzureBlobStorage1, that's the same link service that I chose before because they're in the same blob storage for File path, I'll click the Browse to the right and on the Browse blade, I'll choose stage2 and I'll hit OK at the bottom. Then for Set properties I'll click OK at the bottom. So I have now successfully set up a Copy data step and its Source its the stage1 folder and its Sink is a stage2 folder, so it will copy all the files from stage1 to stage2. Now I want a second Copy data block under Activities in the middle, I'm going to drag another Copy data block and drop it. In the area and I'm just going to minimize Activities and minimize Integrates so we can see this better. So now I have two Copy data steps
[Video description begins] The right pane now includes another box with the text Copy data1. [Video description ends]
I'm going to connect them you'll notice on the 1st Copy data step there's a green rectangle on its right. If I grab that and drag it.
It gives me an error and I can connect it to the other Copy data. So this means that the first Copy data block will run and when it's finished the second Copy data block will run. So in my pipeline I'm going to click the second Copy data block and then in the options at the bottom
[Video description begins] The following tabs appear under the boxes: General, Source, Sink, Mapping, Settings, and User properties. [Video description ends]
under General, I'm going to give it the Name of copy stage2 to stage3. I'll click the Source tab
[Video description begins] The text in the second box now reads: copy stage2 to stage3. [Video description ends]
and I already have a stage2 data set, so I don't need to create a new one, so I'm going to pull the pull-down menu and choose Stage2. So we're copying from Stage2, then I'm going to click the Sink tab and now I do need a new data set, so I'll click New besides Sink dataset and on the New integration dataset blade at the right, I'll click Azure Blob Storage and click Continue at the bottom. For Select format I'll choose Binary and click Continue at the bottom.
For Set properties, the Name is going to be Stage3, for Linked service, I can choose my AzureBlobStorage1, and for File path, I'll click the Browse at the right and under Browse I'll choose stage3 and click OK at the bottom and then for Set properties I'll click OK at the bottom. So I've now set up a two stage copy. Now what we want to do is run this, but the first thing we need to do is publish it. So at the very top of the screen on the main menu, there's a Publish all button. I'll click that and I get to Publish all blade on the right, and I'll click Publish at the bottom of that. So that's successful. Now what we want to do is create a trigger. A trigger is going to run our pipeline just above the Pipeline Window, 4th button from the left is Add trigger and I'm going to choose New/Edit. I get an Add triggers blade on the right and for Choose trigger I'm going to pull up pull down and I'm going to click New.
That gives me a new blade called New trigger. I'll just leave the Name is Trigger 1. The Type is Schedule, Start date defaults to now. Time zone is UTC. That's fine, and under Recurrence, I'm going to set it to every one minute and at the bottom I'm going to click OK and then under New trigger I'm going to click OK. Then at the very top the main menu of the screen I'm going to click Publish all and on the Publish all blade on the right. I'm going to click Publish at the bottom. So that's Publishing our trigger and that trigger is going to run our pipeline every minute, so that means that every minute it should take whatever is in stage1, copy it to stage2, and then from stage2 copy to stage3. Now we can see whether our trigger is running by going to the left menu and going to Monitor, which is the 2nd button from the bottom and on the right-hand side
[Video description begins] The right pane now shows a section called Pipeline runs. It includes a table. [Video description ends]
we have a Pipeline runs table and at the top of it we see our Pipeline 1 and it is actually running it says status In progress, so it's not complete yet. If I hit Refresh at the top now I see it succeeded. So I'm going to go back to Azure and let's see what happened.
If I click the Home at the top left and I go to my Storage account under
[Video description begins] The Microsoft Azure homepage displays again. [Video description ends]
Recent resources and then on the Overview I go down to Containers,
[Video description begins] The synapsedl2255 page opens. [Video description ends]
and then under Containers, I click stage1. Now remember we haven't put any
[Video description begins] A new page titled stage1 Container opens. The right pane displays a table with the following column headers: Name, Modified, Access tier, Blob type, Size, and Lease state. Currently, this table doesn't include any data. There are various buttons at the top of the table such as: Upload, Add Directory, and Delete. [Video description ends]
files in stage1 yet, so that trigger would have successfully copied nothing. So what we'll do is we'll upload something. So I'll click Upload at the top to the right I get an Upload blob blade and I'll click the browse beside the Files and I'm just going to choose any file that I have. I have a C# solution file here. It doesn't really matter what file it is, I'll click Open and then on the blade I'll click Upload at the bottom and it's uploaded that file. So now at the top left of the
[Video description begins] The table now displays an entry named EventHubWriter.sln. [Video description ends]
screen I'm going to click my Storage account name just to take me back to the Containers and under Containers on the right,
[Video description begins] The synapsedl2255 | Containers section opens again. [Video description ends]
I'm going to click stage2. Notice it's empty. If I hit Refresh a few times, it's still empty, so the results have just shown up. Now in stage2, we have EventHubWriter.sln, If I click the name of my Storage account at the top left and go to stage3 on the right, I see it's also copied there. So we have success.
Visualizing Data Using Power BI
In this video, I'm going to demonstrate how to integrate Synapse Analytics with Power BI to create visual reports. For this demo, you'll need a Power BI subscription that's linked to your Azure tenant, and you also need the Power BI desktop application installed on your local machine. So I'm starting in the Power BI website and that's app.powerbi.com and the first thing I'm going to do is create a workspace.
[Video description begins] The window titled "Power BI" appears. The left pane contains the following options: Home, Favorites, Recent, Create, Workspaces, and so on. [Video description ends]
So on the left menu at the very bottom, I'm going to click Workspaces and a blade pops up that says My workspace at the top and at the bottom, there's a yellow button called Create a workspace. I'm going to click that.
I get a blade to the right that's called Create a workspace and it's asking for a Workspace name. So I'm going to call it demo-workspace, and I'll click Save at the bottom. So that's created our demo-workspace. Now what I'm going to do is use Power BI desktop to load some data into that workspace. So here's Power BI,
[Video description begins] The "Power BI Desktop" opens. The working pane is divided into three sections. On the left it shows three basic view icons: Report view, Data View, and Relationship view. The middle section is canvas area that is currently empty. The right section displays three panes: Filters, Visualizations and Fields. [Video description ends]
and the first thing that you see when you open Power BI desktop is this Add data to your report and there's a box to the right that says Try a sample dataset. That's what we want to do. So I'm going to click that. I get a window that says Two ways to use sample data Click to Load sample data button at the bottom right, and now I get a Navigator window with a folder on the left, that says Financial Sample.
Underneath that are two tables. One is called financials. So I'm going to check the box beside financials,
[Video description begins] The right section displays a table with the following column headers: Segment, Country, Product, and so on. [Video description ends]
and then I'm going to click Load at the bottom right. So what it's doing is loading that data and creating and empty report, and we're going to upload that to Power BI, so it's finished. So on the main menu at the very top on the far right is a Publish button. I'll click that,
[Video description begins] A pop-up box titled, Microsoft Power BI Desktop appears. [Video description ends]
and I get a window that says Do you want to save your changes? and I'll click Save. Now it asks me to save it to a local location.
I'm going to give it a name of demo-data and I'll click Save at the bottom right. Now it asked me to Publish to Power BI and it asked me to Select a destination and I see My workspaces and the one we just created is called demo-workspace, I'll choose that and then I'll click Select at the bottom right. So it's now Publishing that to the web and it's successful. So that's all we need Power BI desktop for now that we have that sample dataset, I'm going to close Power BI desktop, and now I'm back in the website and you can see on my website I'm still on demo-workspace,
[Video description begins] The content pane contains three tabs: All, Content, and Datasets + dataflows. Currently, the All tabs is selected. This tab displays a table with the following column headers: Name, Type, Owner, and so on. [Video description ends]
but now there are two rows in the table on this page, there's a Report and there's a Dataset.
As I mentioned, it creates the dataset and then it creates an empty Report. I'm going to click the ellipsis just to the left of the Type column for the demo-data report,
[Video description begins] A drop-down menu appears. It has the following options: Analyze in Excel, Delete, Settings, and so on. [Video description ends]
and I'm going to choose Delete because we don't want this report for this demo.
[Video description begins] A pop-up box titled, Delete report appears. [Video description ends]
It asked me, Are you sure you want to permanently delete demo-data? and click Delete. Now I'm left with the dataset, so this is what we want, since when we go to Synapse Analytics it will want to load a dataset. So let's go to Synapse Analytics now, I have my workspace open and on the left menu at the very bottom
[Video description begins] The "Microsoft Azure | Synapse Analytics" appears. The navigation pane contains the following icons: Home, Data, Develop, Integrate, Monitor, and Manage. Currently, the Home icon is selected. The content pane displays the following cards: Ingest, Explore and analyze, Visualize, and Learn. [Video description ends]
I'm going to click on Manage,
[Video description begins] The content pane is divided into two sections. The left pane contains the following options: Analytics pools, External connections, Integration, Security, and so on. Currently, the Linked services option is selected. The content pane displays a table with the following column headers: Name, Type, Related, and Annotations. [Video description ends]
and in the options on the left, under External connections, choose linked services if it's not already chosen and on the right-hand side, you'll see a table of your link services at the top of that, click New.
Now on the blade on the right, its called New linked service. At the very top, there's a big yellow banner and it's Carter button, It says Connect to Power BI, so I'm going to click that, and now the blade says New linked service (Power BI). It asks for a name, I'm just going to let it default. It asks for a Tenant. That's your Azure Tenant that your Power BI is associated with. Workspace name if you pull it down. If you've created the demo-workspace, you should see it here. So I'm going to select that, and then at the bottom left I'm going to click Create. So now I have a connection to my Power BI dataset and I can create reports with it. So on the left-hand side, I'm going to click the third button from the top which is called Develop, and on the Develop section just to the right, there's a plus sign at the top right of that section.
[Video description begins] A drop-down menu appears. It has the following options: SQL script, Notebook, Data flow, Power BI report. and so on. [Video description ends]
I'm going to click that, and I'm going to choose Power BI report. On the right-hand side, I have a blade that says New Power BI report and it's asking me to Select a Power BI dataset to create a report. So, this is why we had to upload the dataset. I'm going to choose demo-data and click Create at the bottom, and now we have a screen that shows a full report tool and we can create a full report on this.
[Video description begins] The "Power BI reports" opens. [Video description ends]
So right now I'm just going to minimize the Develop section for a moment so that we have more space. On the left-hand side, I have a report window and on the right-hand side, I have Filters, Visualizations, and Fields. So under Visualizations, I'm going to click the button that's on the top row, second column. It's a Stacked column chart. Click that, and as soon as I click it, I get a column chart on my report on the left.
So I'm going to make it a little bigger, and now I can choose data for it. So on the very right-hand side under Fields, I'm going to click on Profit, and when I do that adds the Profit field to the Values in the Visualization and I'm going to show the Profit by Country. So in Fields, I'm also going to click Country now I'll automatically add Country to the axis. So now on the left, we have a chart on our report which shows the Profit by Country. So I'm going to save this by clicking on File in the top left and click Save as. I get a window that says Save your report, Enter a name for your report. So I'm going to call this demo-report and click Save.
The Power BI report is now saved successfully with Synapse Analytics. So I'm just going to close the demo-report tab at the very top, and just to show where that is, I'm now going to expand, my resources pane on the left and under Develop, you can now see there's a whole Power BI section and I can see my workspace, I can see the Power BI datasets and I can see my demo-report. Now I'm going to switch back to the Power BI website and if I refresh the website, I now see my report on the website under demo-workspace and if I click it, I can see the contents. So in conclusion, Synapse Analytics can connect to Power BI and act as a fully-featured report designer, which synchronizes its reports with Power BI on the cloud.
Monitoring a Synapse Workspace
[Video description begins] The "Microsoft Azure | Synapse Analytics" appears. The navigation pane contains the following icons: Home, Data, Develop, Integrate, Monitor, and Manage. Currently, the Home icon is selected. The content pane displays the following cards: Ingest, Explore and analyze, Visualize, and Learn. Beneath these cards it displays a section titled, Recent resources. This section contains a table with the two column headers: Name and Last opened by you. [Video description ends]
In this video, I'm going to demonstrate how to monitor Pipeline runs, Trigger runs, Apache Spark applications, and SQL requests all in Synapse Analytics. So, I'm in my Synapse Analytics Workspace and on the left menu I'm going to go to the fifth button from the top and it's called Monitor. Now click on that. And this is the central place where you can monitor
[Video description begins] The content pane is divided into two sections. The left section constitutes the following options: Integration, Activities, and Analytics pools. [Video description ends]
everything in Synapse Analytics. We're going to start from the top under Integration just to the right of the menu, and the first option is Pipeline runs, so let's click on that. These are all old Pipeline runs, so let's create a
[Video description begins] The "Pipeline runs" page opens. The content pane displays a table with the following column headers: Pipeline name, Run start, Run end, and so on. [Video description ends]
new pipeline. So back to the menu on the left. I'm going to go to the Integrate button, which is fourth from the top. And just to the right of that, I get an Integrate section and there's a plus button at the top of the Integrated section. I'll click on that and I'll choose Pipeline. So, it
[Video description begins] A drop-down with the following options appears: Pipeline, Copy Data tool, and Browse gallery. [Video description ends]
[Video description begins] The left pane now displays the Pipelines section. The working pane is divided into three sections. The left section shows a list of Activities, namely: Synapse, Move & transform, and Azure Data Explorer. The middle section includes the following buttons: Validate, Debug, and Add trigger. The right section constitutes the heading titled, Properties. It has two tabs: General and Related. Currently, the General tab is selected. It has the following option: Name, Description, Concurrency, and Annotations. [Video description ends]
creates a pipeline for me, called Pipeline 1. I'm going to rename it to demo-pipeline so we can recognize it. Now under the Activities to the right of the Integrate section. The first option is Synapse. I'm going to open that and I'm going to drag a Notebook onto my pipeline area on the right.
[Video description begins] As he drag the Notebook, three new tabs appears, namely: General, Settings, and User properties. Currently, the General tab is selected. This tab has the following fields: Name, Descriptions, Timeout, and so on. [Video description ends]
So, in the properties below, the name is Notebook1. I'm going to change that to demo-notebook and I'm going to click on the Settings tab and on the Settings tab, it's asking for a Notebook and we don't have one yet. So, I'm going to click the New button to the right of Notebook. So now it takes me to a Notebook area. And I'm going to create a very simple Notebook with one line and it's just going to be print("hello world"). I'll choose a Language
[Video description begins] Line 1 reads as: print("hello world"). [Video description ends]
on the menu at the top to be PySpark, and there's a section that says Attach to in the main menu and it tells you to select an Apache Spark pool. Now we don't have an Apache Spark pool yet, so we're going to create one.
The dropdown says Manage pools, so I'll click on that and this takes me to an Apache Spark pool screen and at the top of the screen is a New button, so I'll click the New button and I get a blade on the right call to Create Apache Spark pool, and I'm on the Basics tab. So, under Apache Spark pool
[Video description begins] The blade constitutes four tabs: Basics, Additional settings, Tags, and Review + create. Currently, the Basics tab is selected. [Video description ends]
details I have to give it a name. So, we'll call it demosparkpool. Under Node size, I'm going to choose the smallest 4 vCores Autoscale is going to be Disabled and I'm going to minimize the Number of nodes to 3 and then at the bottom will click Review and create. At the top and green it says Validation succeeded, so at the bottom will click Create. So, while that's deploying the demosparkpool, I'm going to go to the menu on the left and I'm going to click on Integrate and I see that the Spark pool is deployed and now on the right. I'm looking at my Notebook again, so I need to choose an Apache Spark pool and we just created one. So, if I pull down the Select Apache Spark pull, pull down in the main menu, I can choose demosparkpool. I notice that it's not started, but when I click the Run cell button to the left of my code, it says Connecting and Starting just above my code and now running OK. So that's completed now I actually pause the video because that took a couple of minutes for the Spark pool to start up. But you see that it's printed out hello world, so we know that our Notebook works in the Properties on the right-hand side. I'm going to rename it to demo-notebook. Now on the tabs at the very top I'm going to go back to demo-pipeline. And I'm going to click on my Notebook box and under Settings below I'm going to click the Settings tab and notice I've chosen demo-notebook. So now we've set up a pipeline and we should be able to run it. But before we run it, we need to publish it. So, at the very top menu I'm going to click the Publish all button and I get to Publish all blade on the right
[Video description begins] The blade displays a table with the following column headers: NAME, CHANGE, and EXISTING. [Video description ends]
and I'm going to click the Publish button at the bottom. So just to review, we've created a pipeline with one step that runs a Notebook and we also had to create the Spark cluster that runs the Notebook. So that's now all set up. Now going back to the menu in the pipeline area, I'm going to click Add trigger and I'm going to hit trigger now. So, I get a Pipeline run blade and I'm going to click OK at the bottom. So now it tells me it's running the demo-pipeline. Now we're running this so I can show you how you monitor that run. So, in the left menu, second button from the bottom, I'm going to click Monitor.
[Video description begins] The content pane is divided into two sections. The left section constitutes the following options: Integration, Activities, and Analytics pools. Currently, the Pipeline runs option is selected. The right section displays a table with the following column headers: Pipeline name, Run start, Status, and so on. [Video description ends]
And now in the Integration section to the right of the menu I'm on, Pipeline runs and to the right of that is the table of Pipeline runs. We're currently looking at the triggered ones. If I looked at the top of the screen, there are tabs that say Triggered and Debug. I have it on Triggered and we're looking at the Triggered Pipeline runs. The first one is demo-pipeline and you see it's in progress. If I click on it. I can see the one step and
[Video description begins] The subsequent page opens. It displays a table with the following column headers: Activity name, Activity type, Run start, and so on. [Video description ends]
the fact that it has a blue circle at the top right tells me that it's in progress. So, I'm just going to pause this for a second and let it complete. OK, that took just a little over 3 minutes, but it has completed successfully. So now what I can do is if I go down to the bottom in Activity runs, I can click the Activity name demo-notebook. And it takes me to details of this particular run. So here you can do a little bit of troubleshooting. You see at the bottom of the screen there's a Logs section. There's two tabs, Diagnostics and Logs under Logs I can choose to look at the standard error, which gives me details about the run, and I can also choose that pull down and choose standard out and we should see our hello world here and here it is because the print sends to standard out so we can look at Logs here. And you can also Download logs. You see there's a button on the right-hand side and you can Filter errors and warnings as well, so that's some ways in which you can monitor and troubleshoot Pipeline runs. Now let's take a look at the next section. So, if I go to the left under Integration, the next option is Trigger runs. So, if I click on Trigger runs. I have some old Triggers
[Video description begins] The content pane constitutes the following tabs: All, Schedule, Tumbling window, and so on. Currently, the All tab is selected. Within this tab it displays a table with the following column headers: Trigger name, Trigger type, Trigger time, and so on. [Video description ends]
in here, but you can see at the top of the Trigger runs screen on the right there are several tabs for you can look at all the Triggers you can look at only scheduled Trigger runs of which all the ones I've run are you can look at Tumbling window. These are the different Trigger types, Storage event, Triggers and Custom event Triggers is going to go back to all again. So, let's create and run a Trigger and see it appear on this screen.
So, I'm going to click the Integrate button on the far-left menu. So now we can see our demo-pipeline, and if I go to the pipeline area on the right-hand side, the menu at the top has Add triggers. So, I'll click on that
[Video description begins] A drop-down menu appears. It has two options: Trigger now and New/Edit. [Video description ends]
and I'll click New/Edit. I get an Add triggers blade and it's got a pull down that says Choose triggers. If I pull that down, I'll choose New. And I get a New trigger blade and I'm going to call it demo-trigger. I'm going to leave the Type to be Scheduled, Start date is now, Time zone is UTC and Recurrence will be Every 1 minute and I'll click OK at the bottom I get a New trigger blade and I'll click OK at the bottom. Now I need to publish it so I'll go up to the main menu, click Publish all I get to Publish all blade and I'll click Publish at the bottom. So that's successfully published, so that should be running now. So, let's go back to Monitor. So go back to the left menu and click Monitor. And low and behold, on the right-hand side with the Trigger runs list, I see the very first row is demo-trigger and Status Succeeded. So, the next thing to look at in the list just to the left of Trigger runs under Activities is Apache Spark applications. Now you will see
[Video description begins] The "Apache Spark applications" page opens. It displays a table with the following column headers: Application name, Submitter, Submit time, Status, and so on. [Video description ends]
the list on the right-hand side, the very first row of the table has a Status of Running and the Pool name is demosparkpool. That's the Spark pool we set up for our Notebook that runs with our pipeline, so it's already listing it as an activity. If I click its name on the left-hand side under Application name, I will get details on it and I can see Logs at the bottom and Diagnostics
[Video description begins] The content pane displays various buttons: Cancel, Refresh, and Spark UI. [Video description ends]
and I can Download the logs just like I could with the particular runs of my pipeline. I can also click the Spark UI button on the main menu.
For additional information about the Spark cluster now going back to Activities on the left, the final thing we're going to look at is SQL requests. If I click on SQL requests, this will show every SQL request that's run. So, to show that, let's start up a SQL pool and I'll do a query and show you how it appears here. So, on the left menu I'm going to click the very bottom button Manage, and just to the right of the menu. I'm going to choose under Analytics pools. I'm going to choose SQL pools, and on the SQL pools list
[Video description begins] The content pane displays a table with the following column headers: Name, Type, Status, and Size. [Video description ends]
on the Right, I'm going to click New at the top. So, I get a blade that says Create dedicated SQL pool under Dedicated SQL pool details on the Basics
[Video description begins] The blade constitutes various tabs: Basics, Additional settings, Tags, and Review + create. [Video description ends]
tab I'm going to fill in a Dedicated SQL pool name, so I'm just going to call it demosqlpool. I'm going to turn the Performance level down to minimum and click Review and create at the bottom and then on the Create dedicated SQL pool blade I'll click Create at the bottom. So now that's Deploying.
I'm just going to pause until it's finished Deploying. OK, so the status of our demosqlpool in the table is now Online. So now I'm going to go to the menu to the far left and I'm going to go up to the second button from the top called Data. Click on that and on the Data section just to the right, there's two tabs, Workspace and Linked among the Workspace tab and under Databases on that section I'm going to click the ellipsis on the right-hand side and click Refresh. And now there's one Database. If I open up Databases, there's our demosqlpool, so I'm going to click the ellipsis. Besides SQL Pool, and I'm going to click New SQL script and choose Empty script. I'm going to write a script to fail on purpose, select * from sometable when I know that sometable doesn't exist. So, I'm going to click Run in the main menu and as I expected in the Messages at the bottom it says Invalid object name sometable. Now let's go back to our Monitor section. So back to the menu on the left I'm going to click Monitor and among SQL requests and one thing I didn't notice before is that our Pool is on the Built-in Pool. So, in the main menu of SQL requests at the top I'll click Pool: Built-in and instead of Built-in. I'll choose our demosqlpool. And will see the results. I actually did have historical results from other SQL pools, but the one at the very top is the one that we created. If I click More, I can see the Request content blade pops up on the right and I can see my select * from sometable, so I'll click Close at the bottom of that and if I scroll the table, you can see its Status is Failed. From here you can actually take a look at what queries have been run against SQL pools and their Status and you can get some more information. So, in conclusion, the various Spark and SQL functional components of Synapse Analytics can be monitored in one central location where you have access to Logs. What queries have been run and even more in-depth information.
Exploring the Synapse Knowledge Center
In this video, I'm going to demonstrate the Azure Synapse Analytics Knowledge Center, which contains many datasets and samples to help you get going in Synapse Analytics. So, I'm on my Home screen in Azure and for this demo you're going to need a Synapse workspace. I have one under Recent resources, so I'm going to click on that. And on my Overview on the right-hand side,
[Video description begins] A page titled, synapse2255 appears on the screen. The left navigation pane contains options, namely: Overview, Activity log, Tags, Security, and so on. Currently, the Overview tab is selected. The content pane displays the corresponding details. [Video description ends]
I'm going to scroll down and under Getting started I'm going to click Open under Open Synapse Studio. So here I am, in the workspace and right on the main home page on the right-hand side is a box that says Learn. So, I'm going to
[Video description begins] The "Microsoft Azure | Synapse Analytics" appears. The navigation pane contains the following icons: Home, Data, Develop, Integrate, Monitor, and Manage. Currently, the Home icon is selected. The content pane displays the following cards: Ingest, Explore and analyze, Visualize, and Learn. [Video description ends]
click on that and then it takes you to the Knowledge center. So, I'm on the Knowledge center screen. At the bottom are three boxes. The first says Use samples immediately and these are fully self contained samples. We're going
to take a look at one of those in a minute. The other box beside it is Browse gallery and this has a lot of different datasets and sample code that you can look at. And finally, the right-hand box says Tour Synapse Studio. If you click on this then the wizard will step you through the different features of Synapse Studio. So I'm going to start off with Use samples immediately. I'm going to click that box on the left. And I get a blade on the right called Use samples immediately, and there's several of them.
[Video description begins] The blade displays the following options: Explore sample data with Spark, Query data with SQL, Create external table with SQL, and so on. [Video description ends]
I'm going to click the one that says Query data with SQL. And I'll click Use sample at the bottom left. So immediately I get a code window pop up that has my SQL in it. You see on the left it says Develop and under SQL scripts it says Query data with SQL. So, its generated this for me automatically. I'm going to click Publish all in the main menu. And I get a blade on the
[Video description begins] The blade displays a table with the following column headers: NAME, CHANGE, and Existing. [Video description ends]
right that says Publish all and I'm going to click Publish at the bottom, so that's complete. So now if I scroll down in the query, you'll see several queries are in here. So, line 28 says Select Top 100 * From, in line 29 is an OPENROWSET. Line 30 is a URL to some blob storage. So, this is going to
[Video description begins] Line 29 reads as: OPENROWSET(. [Video description ends]
[Video description begins] Line 30 reads as: BULK 'https://azureopendatastorage.blob.core.windows.net/nyctlc/ yellow/puYear=2019/puMonth=*/*.parquet',. [Video description ends]
read parquet files. I can see at the end of that string on line 30 it says .parquet, so it's going to load all the parquet files from some blob storage and then on line 31 it tells it the format which is parquet. So, this is going to load some external blobs, so let's run that. I'm going to
[Video description begins] Line 31 reads as: FORMAT= 'PARQUET'. Line 32 reads as: ) AS [nyc];. [Video description ends]
just highlight from line 28 to line 32 and if I do that it will just run that SQL in the menu just above my code window on the left I click Run and I get the Results at the bottom. I just scroll the Results up a little bit. I see
[Video description begins] The "Results" tab displays a table with the following column headers: vendorID, passengerCount, and so on. [Video description ends]
a table with a whole bunch of data, so that's come from the parquet files. I can also click on Chart and I can see that charted. So, let's go back to
the home screen. I'm going to go to the left menu, click the button at the top. And now I'm back at the home page. So, on the far right I'm going to click Learn again. Back to the Knowledge center. Now at the bottom I'm going to click Browse gallery the center box. So, these are all sorts of datasets
[Video description begins] The content pane displays the following tabs: Datasets, Notebooks, SQL scripts, and Pipelines. Currently, the Datasets tab is selected. This tab shows various datasets, namely: Bing COVID-19 Data, Boston Safety Data, and so on. [Video description ends]
I'm going to click the top left one, which is Bing COVID-19 data. And then I'm going to click Continue at the bottom left of the screen. So now I can see all sorts of data on a Preview on the right-hand side
[Video description begins] The subsequent page opens. The content pane is divided into two sections: Description and Preview. The Preview section displays a table with the following column headers: id, updated, confirmed, deaths, and so on. [Video description ends]
there's a big table. At the bottom left I'll click Add dataset. And at the
top right of the screen it tells me that the datasets was successfully created. On the left-hand side of my screen, in the Data section. Under Azure Blob Storage and Sample data, I see bing-covid-19-data, so I'm going to click on the ellipsis on the right-hand side of bing-covid-19-data. I'm going to
[Video description begins] A drop down menus appears. It contains the following options: New SQL scripts, New notebook, Edit, and so on. [Video description ends]
choose New SQL script and click Select TOP 100 rows. So, it generates that for me and at the top left I'm going to click Run just above the code window. And I get data at the bottom. I'm just going to drag the Results window at the bottom. And I'm going to click on Chart under View. And so now I'm seeing COVID data in a chart. So just to show something interesting,
let's change the query. So, I'm going to minimize the Results window. And I'm going to replace the query. Now, this is pretty well the same query as before. I've expanded the SELECT to be SELECT * on line 2 and on line 7. I've added a where clause where the [result].country_region equals worldwide. So, I'm going to click Run at the top left. Now we'll expand the
[Video description begins] Line 7 reads as: ) As [result] where[result].country_region= 'worldwide';. [Video description ends]
Results at the bottom, and I'm going to hide the series that I don't want by clicking on them in the legend. So now I have a graph that only shows the confirmed and recovered cases worldwide. So, in conclusion, the Knowledge center is very useful because it can get you going really quick and you can play around with some data and get to learn Azure Synapse Analytics.
Azure Synapse Analytics and PolyBase
In this video, we're going to look at how PolyBase is used to query data. PolyBase allows the following data sources to be queried directly using Transact SQL or T-SQL. Hadoop, Cosmos DB, MongoDB, Teradata, Oracle and SQL Server. All of these sources can be treated the same from the point of view of the person writing the query. Because PolyBases T-SQL functionality normalizes the interface. Many SQL product support PolyBase SQL Server 2016 version 13.x and later for Windows, SQL Server 2019 15.x for Linux, SQL Server Parallel Data Warehouse or PDW and Azure Synapse Analytics. As briefly mentioned, PolyBase uses Transact-SQL or key SQL to query data. Importantly, this feature allows joins of SQL Server data with external data sources. Moreover, these joints have the same syntax
regardless of the type of external source. This is a very powerful capability because in the past joining two disparate data sources like this would have involved either moving data until they were in the same data source and then performing the join. Or us querying the two data sources separately and joining the data via special code. Latest versions of SQL Server PolyBase allow many scenarios with many different sources, but two of the main scenarios that they all support are connecting with Hadoop, which is a popular scalable, and relatively inexpensive big data storage solution, and connecting with Azure blob storage, which is an inexpensive and convenient storage for Azure services. Some other common scenarios are importing data from Hadoop, Azure blob storage or Azure Data Lake into SQL Server or Synapse Analytics and also Exporting data to those data sources from SQL Server or Synapse Analytics. PolyBase does this by using what are called external tables, which are really references to tables in an external data source such as Hadoop or blob storage.
These external tables can be manipulated via T-SQL in much the same way as local tables are. Thus, importing data, for instance, simply involves copying data from the external table to the internal table. Transforms can also be performed on the data as it moves through the external table to the internal table. In this way, PolyBase implements a complete ETL or Extract Transform Load pipeline. PolyBase implement some tricks for maximizing Performance. If interacting with Hadoop, some of the computations can be pushed to Hadoop, easing the performance cost on the SQL Server rent. This is managed by a component called the query optimizer, which dynamically decides which
computations are handled locally and which are handled by Hadoop PolyBase scale out groups. Managed parallel data transfer between nodes in Hadoop and SQL Server. It also scales the compute for external data operations. Note that this can all occur without the knowledge of the person writing the T-SQL queries. Because it's all abstracted from the actual query. Because PolyBase is integrated with SQL Server, it can be used with any SQL Server compatible third-party BI tools it's compatible with Microsoft Business intelligence and analysis components. This diagram shows how PolyBase access is an external data source. PolyBase can scale out to several Compute
nodes, each of which contains a Data Movement Service, or DMS using a Hadoop Distributed File System bridge. Each node can communicate with the data in the external data source, such as blobs in Azure Blob Storage. This parallel architecture means fast performance for data movement. So, here's the PolyBase Process from start to finish. First you extract your data into text files. Perhaps your data is coming from an IoT hub. The events at the Hub would be Extracted from the hub and translated into files. These text files would then be Loaded into the external data source, such as Hadoop or blob storage. PolyBase T-SQL is then used to map the external data source and the data would be Imported into either SQL Server or Azure Synapse Analytics. At this point you can perform Transforms on the data via the T-SQL, although you don't have to. Then the data is Inserted into the sink or final production tables. This process represents a complete Extract Transform Load pipeline from extracting source data to inserting transform data to the final tables.
Course Summary
So in this course, we've examined Azure Synapse Analytics. We did this by exploring Integrating pipelines with Synapse Studio, Visualizing data using a Power BI workspace, Monitoring a Synapse Workspace and exploring the Synapse Knowledge Center, and Features of Azure Synapse Analytics and PolyBase. In our next course, we'll move on to discuss Azure data storage monitoring.
Comments
Post a Comment