unisys trainning 7th percipio traaining AZURE DATA FNDATAMENTALS :AZURE ANALYTICS WORKLOAD

 

Azure Data Fundamentals: Azure Analytics Workloads

Azure Synapse Analytics is a limitless analytics service that brings together data warehousing and big data analytics. In this course, you will learn about analytics workloads, including Azure Synapse Analytics, Azure Synapse SQL pool, Data Warehouse Units. You'll also learn about the difference between transactional and analytic workloads and batch and real time data processing. You'll use Azure Portal and Azure PowerShell to create a Synapse SQL pool and Azure Data Lake Analytics. You'll learn about data warehousing workloads and when to use a data warehouse solution. Finally, you'll learn about the different Azure Data Lake Analytics. This course is one in a series that prepares learners for the Microsoft Azure Data Fundamentals (DP-900) exam.

Course Overview

[Video description begins] Topic title: Course Overview. [Video description ends]

Hi, my name is Richard Spencer and I'll be your instructor for this course. Born and raised and currently living in Newfoundland, Canada, I'm a college instructor in the IT faculty with specialization in networking and cloud computing. 

[Video description begins] Your host for this session is Richard Spencer. He is a Technology and Developer Specialist. [Video description ends]

I currently serve as the Canadian expert for cloud computing for Skills Competence Canada, and I hold three AWS certifications and a Cisco CCNA. Microsoft's DP-900, Microsoft Azure Data Fundamentals certification is targeted to audience who have foundational knowledge of core data concepts and how they're implemented using Microsoft Azure data services.

Aimed at individuals beginning to work with data in the cloud, familiarity with the concepts of relational and nonrelational data and different types of data workloads such as transactional or analytical would be considered assets for this certification. Azure Synapse Analytics is a limitless analytics service that brings together data warehousing and big data analytics. In this course I'll cover analytics workloads, including Azures Synapse Analytics, Azures Synapse SQL Pool, Data warehouse units or DWU's.

I'll explore the differences between transactional and analytic workloads and batch and real time data processing. I'll use Azure Portal and Azure PowerShell to create a Synapse SQL pool and Azure Data Lake Analytics. I'll discuss data warehousing workloads and when to use a data warehouse solution and the different Azure Data Lake Analytics.

Azure Synapse Analytics Architecture

[Video description begins] Topic title: Azure Synapse Analytics Architecture. Your host for this session is Richard Spencer. [Video description ends]

In this video, I'm going to demonstrate the use of Azure Synapse Analytics. Synapse is a limitless analytics service. It works together to bring in data, prepare that data, manage and then serve that data for business intelligence and machine learning needs.

[Video description begins] The Azure services portal is displayed on the screen. It includes various options, such as: Create a resource, My resources, Azure Synapse Analytics, Resource groups, and various other elements. [Video description ends]

So once we're inside the portal, the quickest way to find it, we can just type in synapse up at the top search bar and you'll find Azure Synapse Analytics. I'm going to go ahead and click. It brings me into the Azure Synapse Analytics console. So the first thing we want to do is come down to create Synapse workspace. It'll bring you into a screen where we can go out and create our workspace.

[Video description begins] The Create Synapse workspace window is displayed on the screen. It includes tabs namely: Basics, Security, Networking, Tags, and Summary. Currently, the Basics tab details are displayed on the screen. [Video description ends]

So we have to provide our subscription. We also have to provide a resource group. So remember, resource groups are used to group resources together and have the same permissions and lifecycles. In this case, I'm going to create a new one. And I'm just going to call it synapseRKS and click OK.

[Video description begins] Under Resource group, he clicks the Create new option, types a Name and clicks the OK button. [Video description ends]

So I've got a new resource group created. We also need workspace details.

So what is the workspace name going to be? So I'll call that synapseworkspaceRKS. I'll change my region to East US. And then we have to specify a Data Lake storage Gen2 account name and file system. So I'm going to create a new account name. I'm going to call it rksdatastoragegen2 and a file system name, I'll call synapsefilesystemrks. 

[Video description begins] Under Account name, he clicks the Create new option, types a Name and clicks the OK button. [Video description ends]

So now we've got our account name and a file system name.

I'm going to assign myself the storage blob data contributor role on the Data Lake.

[Video description begins] He checks the box named Assign myself the Storage Blob Data Contributor role. [Video description ends]

Click next for security. It's going to ask for admin username. I had to click the admin username, which is SQLadminuser, and I give it my password. Make sure it matches on both fields. I come down, I don't want to enable double encryption.

[Video description begins] He unchecks the Enable double encryption option. [Video description ends]

I'm going to allow pipelines, so that way we can access the SQL pools. I click next for networking. I allow connections from all IP addresses and click next for tags. I can put in some tags so I can say created by Richard Spencer.

[Video description begins] Under the Tags option, he enters the Name and Value and clicks the Next Summary button. [Video description ends]

Click next for a summary, it gives you your serverless SQL estimated cost per terabyte to 6.40 CAD. I can review everything that I set up, I see my Data Lake account, file system, all my security and networking settings. So I go ahead and click Create, and now the deployment has initialized and it's in progress.

[Video description begins] The Microsoft Azure.Synapse Analytics page is displayed on the screen. The left pane contains tabs, such as: Overview, Inputs, Outputs, and Template. Currently, the middle pane displays the Overview page. [Video description ends]

So when it completes, you'll get a little green box up here that says completed and you'll be able to click go to resource from here. So it usually takes maybe five to ten minutes and then everything will be ready for you to use. So now you can see that deployment is complete, and I can click go to resource. So I'll go to resource, flip over,
and let's take a look at the workspace web URL.

[Video description begins] The synapseworkspacerks page is displayed on the screen. The left pane contains tabs, such as: Overview, Activity log, Access control, Tags, Diagnose and solve problems and various other tabs under Settings, Analytics pools, and Security. Currently, the middle pane displays the Overview page. [Video description ends] 

So I click here and I go into the workspace web URL and it'll load my Azure Analytics portal. Go ahead and accept the cookies and then Synapse will complete and now we're inside of our workspace. So in the left hand side, we have hubs of different capabilities. I can click expand and we can take a look at all of them.

[Video description begins] The Synapse Analytics workspace is displayed on the screen. The left pane contains hubs, such as: Home, Data, Desktop, Integrate, Monitor, and Manage. [Video description ends]

At the data hub, you'll see workspace and linked data sources. So if I had any databases that were tied in to this Synapse workspace, you would see them here. And under link is where we keep all those that live outside the Synapse workspace. One example of that would be the Data Lake storage that we created earlier.

So you'll find that under linked. Under develop, here's where we can go ahead, create SQL scripts. We can create Spark notebooks. We can do mapping for data flows. And we can power business intelligence reports that can be linked here too. So I can click new SQL script. And I'll be brought in to create a SQL script. And I can also create a new notebook here as well. So this is our development tab. This is where we create our SQL scripts.

We map our data flows. Under Integrate, you'll see this is where we orchestrate, build ingestion and orchestration pipelines that'll copy data from other sources. So I can create a new pipeline or I can copy a data tool. If I click pipeline, you'll see me brought in to the screen where I can actually map out the data ingestion and build that orchestration. And finally under monitor and manage, we can use that to monitor jobs and manage our security. We can manage our SQL pools here in the Apache Spark pools. It's all done under manage.

So I hope this video showed you how to provision a Synapse Analytics workspace. I hope this showed you how it's possible to build an end to end analytics workspace just by provisioning a Synapse workspace in just a few minutes. So get in, dig around, find out what works for you, get your data in, and then start serving it up for business intelligence and machine learning.

Using the Azure Synapse SQL Pool

[Video description begins] Topic title: Using the Azure Synapse SQL Pool. Your host for this session is Richard Spencer. [Video description ends]

In this video, I'm going to discuss some of the best practices for dedicated SQL pools in Azure's Synapse Analytics. So once we're in our dedicated SQL pools console, we can find the SQL pool that we want to maybe fine tune or add some features to.

[Video description begins] The Dedicated SQL pools console is displayed on the screen. On the top-left it includes options, such as: Add, Edit columns, Refresh, Assign tabs, and Delete. [Video description ends]

And we can go ahead and click it. It brings us into this SQL pool console, right?

[Video description begins] The RDS_Richard rkssqlpool console is displayed on the screen. The middle pane contains options such as: Overview, Activity log, Tags, Diagnose and solve problems, and various tabs under Settings, and Security. [Video description ends]

And the first thing it's going to say is your dedicated SQL pools can now be accessed from a Synapse workspace. So create a workspace here. So if you haven't created a workspace yet, but you created your SQL pools, you can, as a best practice, create a workspace and access your SQL pools from your synapse workspace.

Other things we need to look at. We need to reduce costs with pause and scale. So we can pause our SQL pool, it'll come up and say there are no active user queries, would you like to continue on pause, right? 

[Video description begins] He clicks the Pause button at the middle of the screen and clicks the Yes option. [Video description ends]

So we want to do that, we just want to make sure there's no active user queries. So we can pause when it's not in use. You'll see now that it is pausing, and it says this SQL pool is pausing. You may experience limited to no connectivity. So once it's done pausing, you'll see a little pop up come up here and then your SQL pool is paused.

Now it's only good to do that if there's no active user queries, so something you have to be wary about. Once it's paused, you'll see it says this SQL pool is currently paused. You may experience limited to no connectivity and you get a pause database success. I'm going to resume the SQL pool, so I can show you some other best practices and features.

[Video description begins] He clicks the Resume button at the middle of the screen and clicks the Yes option. [Video description ends]

So now it shows me the SQL pool is resuming. Again, we're going to experience limited to no connectivity until the SQL pool is back up and running again. Once it's up and running, you'll get another alert and now you can carry on. So another best practice I want to look at is scaling. So if we click this Scale button, here's where we can scale our system.

We've got a slider. And we can increase the slider to scale our system up, or slide it left to scale it down. You'll notice that DW100c is what I'm currently at, I can slide this all the way up to DW30000c. That's a bit much. We can bring it back down to say DW1500c. And the point of this is that you want to optimize your scaling. You don't want it too big, but you don't want it too small. So in the beginning, you want to increase and decrease and find that sweet spot for your application. We can also view queries here, if there are any queries running.

[Video description begins] He clicks the View Queries option at the middle of the screen. [Video description ends]

We can look at the query activity, which is also good best practice. We may export it to Excel. We can take a deeper look into it. Right now I have no running queries, so there's nothing to see, I click OK.

[Video description begins] The RDS_Richard rkssqlpool console is displayed on the screen. [Video description ends]

We also have Maintenance schedule, so it's a best practice to create a maintenance window. Some other best practices we can look at would be monitoring. So you want to create some alerts and some metrics on your monitoring, whether it's raw long running queries, something along those levels. And it's also good for monitoring to optimize your queries. So if we come in, we look at query activity. If we had any long running queries, we can investigate them here.

[Video description begins] On the left pane, under Monitoring, he clicks the Query Activity. [Video description ends]

And then we can optimize those queries further down the road. So I hope this video has showed you some of the best practices with SQL pools.

Including pausing and restarting your pool, scaling your pool, and monitoring your pool to optimize any long running queries.

Data Warehouse Units

[Video description begins] Topic title: Data Warehouse Units. Your host for this session is Richard Spencer. [Video description ends]

Now let's start our journey into data warehousing by discussing Data Warehouse Unit. Data warehouse units is just a unit of measure of resources that includes CPU, memory, and input/outputs per second that's assigned to your SQL Data Warehouse database. So increasing the number of DWUs will increase resources and performance of your queries by way of parallelly executing them. You can scale by increasing or you can decrease DWUs simply by moving a slider in the Azure portal, or in the properties page of the database in SQL Server Management Studio, or by using PowerShell scripts. Data warehouse units also provide an abstraction and hides details on underlying hardware and software so they can change or move without affecting your workload. With this abstraction in place, Microsoft can adjust the underlying architecture of the service. When you need faster performance, you can simply scale your database up by increasing your DWUs.

And then you can scale your database down by decreasing the DWUs. This way it enables you to minimize the cost because you're only paying for the resources that you need and that you're currently using. Now let's discuss metrics. So how can we determine our metrics for our data warehouse unit? So these metrics are based on data warehouse workload metrics, all right? So when we're dealing with an SQL pool, formerly known as an SQL data warehouse, how fast can that standard dedicated SQL pool query and scan a large number of rows? And then how can it perform those complex aggregations on it that we need? These are both input/output and CPU intensive. Another metric is data ingestion. So how fast can that dedicated SQL pool ingest data from Azure storage blobs or an Azure data lake, right? This is also network and CPU intensive. And table copying, so how fast can we copy a table?

How fast can we read that data from storage, pass it out across the nodes of the appliance, and then write it back to storage again? This is CPU, I/O, and network intensive, right? So these are the metrics that we need to determine how well our data warehouse units are applied. Service level objectives. So an SLO, or service level objective, is an agreement within an SLA about specific metrics. So what metrics do we monitor and determine the performance of our data warehouse units?

We're more concerned about cost and performance. So we want to balance off the difference between cost and performance. So the service level objective is a scalability setting that you can use to determine your cost and performance. You increase the scalability setting, you get increased performance and increased cost. If you decrease that service level objective, you get a decrease in costs and also a decrease in performance. Now let's discuss performance tiers. So each performance tier uses a slightly different unit of measurement for their data warehouse units.

You'll notice this difference because when you look on your bill, the unit of scale is different. So our performance tiers, we have generation one. In generation one, data warehouses are measured in data warehouse units. Generation two uses compute data warehouse units. Otherwise you'll see it abbreviated as cDWUs. Now, both generations allow for scaling up. You can scale down. You can pause when you don't need to compute, right? These operations are all on-demand, nothing changes between the two generations. The only thing is generation two offers a performance increase because it uses a local disk-based cache on the actual compute node itself. So it has access to that data so much quicker.

So when you scale or pause that system, the cache needs to be completely invalidated, right? So then when we bring it back up, there's going to be a period of cache warming before we can reach that optimal performance. So how do we optimize our DWUs for our workloads? What's the ideal number of DWUs that I need? Well, it all depends on your workload and the amount of data you've loaded into your system, right? Small amounts of data, small workloads, you can have a smaller DWU.

If you have a big amount of data, and you're doing a lot of historical data on many years, then you're probably going to want to increase your DWUs. But to optimize, you want to start small, so implement a smaller DWU. Monitor that performance and test the data loads, right? If you see the performances is sluggish and the data loads are high, then you're going to have to make some changes, right? Identify some additional requirements, maybe we need x number of more DWUs. So then we can apply that, and then we can go back and monitor the performance and test the data loads. Identify any additional requirements again, and repeat the process until you've hit the optimal DWUs for your workload. So much that your performance is good, but your cost isn't out of the world, right? Your cost is still down.

You don't want to provision a ton of resources and not use it. The deal here is to optimize your DWUs to get just the perfect amount that you want to run a really fast optimized system without spending more money than you need to. So permission. Changing data warehouse units requires permissions. It's the same as the alter database command. So Azure has built-in roles, such as SQL DB Contributor, and the SQL Server Contributor. And these are your built-in roles that can change DWU settings.

Changing DWUs. So once our permissions are applied, what options do we have for changing the DWUs? We can do it in the Azure portal. So we just go in our database, we click Scale. Under Scale, we'll see a little slider and we'll move it left to decrease, right to increase, right? Left to scale down, right to scale up. We also can do it in PowerShell. So we have a new Azure PowerShell now. We can run the Azure set AZ SQL database PowerShell cmdlet. And that will allow us to update our DWUs right from our PowerShell console. We can use REST APIs and send APIs to Azure to update our DWUs. And we can also use T-SQL, or transactional SQL, and that will allow us to also update our DWUs in our data warehouse.

So scaling workflow. When we actually scale up or scale down, what happens? So we have two options. We can scale up our operations, so that means we're going to increase more resources. So in that case, we have to make sure there's no transactions in process. We need to scale anything back that's in process. Detach all the nodes, provision the new resources, and then reattach. When we scale down, it's the exact opposite of that, right? So we need to make sure all the transactions are complete, we need data consistency. Once we know the transactions are in the scaled back state, we can release our additional nodes, provision down any nodes we don't need, and then reattach with the remaining nodes. So we've scaled down and we've kept our transactions consistent.

Transactional and Analytic Workloads

[Video description begins] Topic title: Transactional and Analytic Workloads. Your host for this session is Richard Spencer. [Video description ends]

Now, let's get into database types, and what the difference are between transactional and analytical workloads. So in this case, we're going to talk about OLTP, On-Line Transaction Processing, and OLAP, On-Line Analytical Processing. Most businesses today store data in an OLTP database. And then that's accessed by different users who perform queries, simple queries. When you go to, say, a supermarket, your point of sale system at the cashier uses an OLTP. Your ATM will also use an OLTP, so an on-line transaction processing system. These OLTP systems are made to store day-to-day business transactions and are great for querying specific, and as I said earlier, simple record. If you want to do any complex querying or compiling year over year historical data, that's where the on-line analytical processing comes into play.

That'll give you a multi-dimensional view of your enterprise data, rather than a transactional level view. But when we put both of these systems together, they form the two sides of a data warehousing coin. OLTP systems are the original data sources across the enterprise. And the OLAP systems integrate that data from these sources and present it in a multi-dimensional view for reporting and analysis. So let's look at some of the difference between OLTP systems and OLAP systems. First of all, the OLTP systems are used to manage the fundamental business operations. So selling, invoicing, these type of things, these are your fundamental business operations. It uses simple query, so no ad hoc queries or nothing too intense. It gives you the day-to-day transactional view. So sales per day, revenue per day, returns per day, right? This all goes back to your fundamental business operations. And it allows for fast processing, so ATM machines, point of sale machines, they need to be processed fast. It's not something we can wait for. These are used by employees, right? So the person checking you out at the supermarket or the teller at the bank, right? That's what we use the OLTP systems for.

Maybe it's people in the business who want to see how many sales he did that day really quick, right, a sales manager of some sort. On the other hand, OLAP systems provide you with that consolidated view of organizational data. And can take it from many different sources, aggregate it, and give it to you. It uses complex queries, it needs to be complex to pull all those different data sources and put them all together. It gives you a multi-dimensional view of your data. It takes in account for complex processing. This requires complex processing. Again, because you have all these different sources, we're pooling in all that different data, using our complex queries. We're aggregating that data and presenting it in a multi-dimensional view. And it's used by analysts. So business analysts, in this case, could use an OLAP system to really get in and look at sales records of the previous years and by region.

And get into some complex queries and really make some good decisions for their business based on this data that's presented to them using the combination of OLTP systems and OLAP systems. Some more database differences. OLTP systems commonly use your regular SQL INSERT, DELETE, and UPDATE commands. They use ACID-Compliance for data integrity. Data is backed up on the regular basis. And it has smaller storage requirements. Whereas, OLAP, on the other hand, commonly uses the SELECT. We're not inserting data, we're not deleting data, we're just viewing data. It has fewer data integrity concerns, because a lot of that is taken care of before the data gets into the data warehouse. Backups are required less frequently. You have larger storage requirements because you're dealing with a larger amount of data.

Response times, OLTP systems require short response times. Again, if you're getting checked out at the grocery store, that point of sale system, you don't want to be there all day, you want a short response time, right? And that's the case for the nature of all OLTP environments, right? It's predominantly any kind of interactive, ad hoc usage. Another example could be telemarketers entering telephone survey results, right? OLTP systems like that require a short response time in order for users to remain productive, right? So if I'm putting in data into the system from a telephone survey and it's taking me two minutes every time I click save, then I'm not going to be very productive. So we want to keep the users productive, whether it's a point of sale system or a back end system where you're putting in telephone survey results.

Regardless of what the usage is, at the end of the day, it's an end user that are using these systems. And we need these response times to be very short to keep these users productive. Small transactions are another feature of OLTP system, right? They typically read and manipulate smaller amounts of data. And the data processing is generally simple. We very rarely have complex joints, right? As we mentioned many times before, we're doing simple queries here, simple inserts, simple deletes. We're not doing a lot of complex querying and aggregations. Data maintenance. Background programs may require many data-intensive computations. So it's not uncommon to have reporting programs and data updating programs that must run in the background periodically or on an as needed basis. These programs, which run in the background while you're working away on other tasks, may require a large number of data-intensive computations. One of these examples may be a university may start batch jobs, assigning students to classes.

While students can still sign up for online classes themselves, right? So we're back to batch jobs here. We may run up a lot of batch jobs to update data in the background while users are doing their regular day-to-day tasks. OLTP Environments. They must be highly available, right? So these availability requirements are often really hot. An unavailable OLTP system can impact a large user population. Organizations can suffer major losses if OLTP systems aren't available when they should be. So a stock exchange system, for example, could have extremely high availability requirements during trading hours. We also deal a lot with high concurrency. So you're going to have a lot of users and you're going to need short response time. And combine that with the small transactions, concurrency in OLTP environments is very hot. So requirement for thousands of concurrent users is not something out of the realm of possibility. So depending on your application type, your user population, and your data retention time, that will make a big difference in the volume of your data.

If you have a small application with three or four users, and you only keep data for a month, you're not going to be dealing with a lot of data. If you have a major application that use some form of ERP application, with 1,000 users, and you're keeping data for seven years, you're going to be dealing with large amounts of data volume. Lifecycle-related data usage. You may encounter multiple data access patterns over time. Very similar to data warehouse environments, these OLTP systems often experience different access patterns over time. So end of month data access patterns could be different than the end of the year data access patterns, and so on.

So depending on what the users are looking for in that time of the year, or that time of the lifecycle, shall we say, you may encounter different data access patterns.

Batch and Real Time Data Processing

[Video description begins] Topic title: Batch and Real Time Data Processing. Your host for this session is Richard Spencer. [Video description ends]

Batch processing is just an efficient way of processing massive amounts of data, where group of transactions collect the data over a period of time. So the data is collected, entered, and processed, and then the batch results are produced. So batch processing requires separate programs for input, processing, and output. So we're batching it we're doing all in a batch. We're collecting all the records and we're processing them and then we're outputting them. One example could be payroll systems or billing systems, right? But on the flip side, real-time data processing is continual input, continual processing, right? Data must be processed in a small time period. Things like ATM machines or radar systems.

These can't be batched, right? We can't batch process an ATM machine and wait a length of time, maybe up to 12 hours, before banking records get updated. It's not acceptable. We need in this case real-time data processing. So that's the difference between batch processing and real-time data processing. Real-time data processing is great for gathering information to make quick decisions. So you don't want to wait 12 hours, 24 hours for a batch of records to be processed and reports to be generated. All right, we want to do it quick, so we need real-time data process. Most organizations use some form of batch data for different aspects of their business. But when you're dealing with real-time data and you need to make decisions based on that real-time data, then you need to process that data in real-time. It can't wait.

Real-time data processing and analytics allows an organization like yours, the ability to take action immediately, right, when time is of the essence. The goal is to obtain the insight required to act prudently at the right time. Which nowadays usually means right away or immediately. Complex event processing. We use this to detect patterns and identify threat. So it combines data from multiple sources to detect patterns and attempt to identify, say opportunities or threats. The goal is to identify significant events and respond fast. So things like sales leads or orders from a customer, for example. Operational intelligence, OI. A big buzz term these days, operational intelligence, that uses real-time data processing and complex event processing to gain insight into operations.

By running these query analysis against live data, right? So operational intelligence is near real-time analytics over your operational data. And you can view that data from many different sources. Intelligence, we discussed operational intelligence in the last slide. But operational intelligence again is used to detect and remedy problems immediately before the customer even knows of a problem. And in order to do that, we need to access that real-time data. And we need to apply this operational intelligence to that real-time data. That brings us to operational business intelligence. Operational business intelligence is more of historic analysis. Okay, so is descriptive or historical analysis of our operational data. Operational intelligence is more real-time analysis of operational data. Whereas operational business intelligence is again we're looking at historic data.

So operational intelligence more real-time. Operational business intelligence is more dealing with historic data.

Using Azure Portal to Create a Synapse SQL Pool

[Video description begins] Topic title: Using Azure Portal to Create a Synapse SQL Pool . Your host for this session is Richard Spencer. [Video description ends]

[Video description begins] The Microsoft Azure services portal is displayed on the screen. It includes options, such as: Create a resource, Azure Synapse Analytics, My resources and various elements. [Video description ends]

In this video, I'm going to show you how to create an SQL pool using the Azure Synapse Portal. So once we're inside Microsoft Azure, I can come up to Azure Synapse Analytics. I can click it and it brings you in and shows you any workspaces you have created. I'm going to go ahead and click my synapseworkspacerks. And now I'm brought into the synapse workspace console. So I usually collapse the left hand side, so I can see more of it. And what we want to do is create a dedicated SQL pool, right?

[Video description begins] The synapseworkspacerks console is displayed on the screen. The left pane contains options such as: Overview, Activity log, Tags, Diagnose and solve problems, and various tabs under Settings, and Security. [Video description ends]

So all our data warehouses are created using these dedicated SQL pools formally known as SQL DW in Azure Synapse Analytics. So what we're doing here is we're just defining a set of compute resources.

So I'm going to come up and click New dedicated SQL pool. It brings me to a screen where I have to put in a dedicated SQL pool name. I'll call it rkssqlpool. And then you see the ever-so-important performance level. So we can move the slider left or right. Left gives us less compute resources and a lower price. And if you pull it all the way up to say DW30000c, it's max resources, but you're paying about $579 per hour. So for this demonstration, let's bring it all the way down to the left to DW100c. I Click Next. It's asking me for a data source. Do I want to start with a blank dedicated SQL pool?

Do I want to restore from a backup? Or populate from a restore point? I'm going to use a blank dedicated SQL pool. I'm going to leave the Collation as default. I add a tag, CreatedBy RichardSpencer.

[Video description begins] He clicks the Next:Tags button at the bottom of the page. [Video description ends]

[Video description begins] He enters the Name and Value and clicks the Next: Review + create button at the bottom of the page. [Video description ends]

I click Next, I review everything that I just set up. My performance level, my cost, and I click Create. The deployment has initialized and submitted. And it doesn't take very long, maybe about five minutes, and then you'll see your resources have been created. So once the provisioning completes, we can click Go to resource.

[Video description begins] The Microsoft.Azure.Synapse console is displayed on the screen. The left pane includes options such as: Overview, Inputs, Outputs, and Template. The middle pane displays the Overview section. [Video description ends]

And under Analytics pools, you'll see SQL pools. You can click there, and then you will see the dedicated SQL pool that we just created.

So you can click it, you can go in there. And now that dedicated SQL pool's available in your workspace for loading data, processing streams, and reading data from the Lake.

Using Azure PowerShell to Create a Synapse SQL Pool

[Video description begins] Topic title: Using Azure PowerShell to Create a Synapse SQL Pool . Your host for this session is Richard Spencer. [Video description ends]

[Video description begins] The Windows PowerShell ISE console is displayed on the screen. [Video description ends]

In this video, I'm going to demonstrate how to create a dedicated SQL pool with Azure PowerShell. So the first thing we need to do is install the Azure PowerShell. So you can find this script online. I've also pasted it in here along with all the other scripts we're going to use. So you'll see the first is just simple if statement, and this is used to install the Azure module for PowerShell.

[Video description begins] Line 2 reads as: if($PVersionTable.PSEdition -eq 'Desktop' -and(GET-Module -Name AzureRM -ListAvailable)) {. Line 3 reads as: Write-Warning -Message ('Az module not installed. Having both the AzureRM and' +. Line 4 reads as: 'Az modules instlled at the same time is not supported. '). Line 5 reads as: } else {. Line 6 reads as: Intall-Module -Name Az -AllowClobber -Scope CurrentUser. Line 7 reads as: }. [Video description ends]

So I'm going to highlight it and run the selection. You may get a pop up that you have to confirm a few things. I've already installed it so mine just finishes and it's nothing else to do. The next important step is to connect our Azure account. So I can highlight it in the ISC. And I can push F8 to run just that selection, or I can right-click and say Run Selection.

So I run the selection. I select my account. I put in my password and now this maps your Powershell environment for your Azure portal. Okay, so you'll see that it contains more than one active subscription. The first one will be used. So that's fine. We've installed PowerShell. We've connected our account. Next we want to get our subscription even though I already know mine, sometimes it's good if you're using multiple subscriptions to pull back your subscription that is actually in use.

[Video description begins] Line 11 reads as: GET -Az Subscription. [Video description ends]

I've got two, as the previous message said, they're going to take the top one. So the Azure subscription 1. So once we're at this stage, we need to set some variables. So we need to set these variables for a resource group name, I don't want to have to specify these over and over.

I just want to set them once in the variables and reference those variables as we go up. So I'm going to create a resource group name as my resource group. The location is going to be in West US Server name equals server dollar sign get random. So we're using a random value. We can replace with our own value if we want. And then we set it admin login and admin password, and then the IP address range that we want to allow access to the server. So quad zeros for start and end means it's fully accessible. Next, we set the variable for our database name, and call it my sample data warehouse. Now highlight this script, right-click and run selection, happens very quick. And we assign those variable names and now that's done. So next thing we need to do is create a new resource group. So we're going to use the cmdlet new-AZresourcegroup, we are going to give it the name of the resource group name that we specified up here, which works out to be my resource group.

And the location references the location variable, which is in West US. So I'll highlight, right-click and run the selection. It says the provided resource group already exists. Are you sure you want to update it? Yes, I do. If it wasn't already created, you wouldn't have got that confirmation. Or if you subsequently do this multiple times, you will get this confirmation box. So I'm going to say yes. Wait for it to finish. The output is here, my resource group in West US and it has succeeded. So our next step is to create a new server.

You'll see the cmdlet new-azsqlserver. We specify the resource group name. And then we specify the variable that we assigned earlier. Server name will be also the variable we assigned earlier along with locations and SQL administrator credentials. So it's a new object of type PsCredential. It takes in the admin login and the password and they convert it to a secure string. So I'll highlight this script and I'll run this selection. So right now it's talking to Azure and it's creating that new server for me. When it's done, you'll always get the output come down here. And you can see that it's running the script or to selection, because it says it down here. So we'll wait for this to finish. And then we can move on to configure some firewall rules and finish with creating that dedicated SQL pool.

So once it's finished, you can see the resource group has been created with login, the password, the version, fully qualified domain name and so on. So our server's now created, let's create the firewall rules. So we're going to say new-azsqlserverfirewallrules. We specify our resource group name and our server name. And then we put in the firewall rule to allow all of our addresses from start IP to end IP. So I'll highlight this selection and I will run it. This is very quick. It's in the output is out. Right, so we're allowing everybody 0000 to 0000. And finally, here we go about creating our dedicated SQL pool. You'll see we have our resource group name. So that's the resource group we're using, all right. It's the same variable we set earlier.

Our server name is the name of the server we set earlier. And database name is the name of the dedicated SQL pool that we're creating, okay? So this is a reference variable, up to here, where the database name is my sample data warehouse. You'll see, the addition this data warehouse, it must be set the data warehouse to create a dedicated SQL pool. And the requested service objective name is the amount of data warehouse units we're requesting. All right, so this is a required parameter. Coalition name is the default coalition. And the maximum size in bytes. This is used to limit row store data. Once I'm happy with everything, I can highlight it, run the selection, and wait. This may take up to ten minutes or so to provision all these resources. When it's done, you'll get an output. And we'll take a look in the console just to confirm everything is how we expect it to be.

So once it's finished, you'll see your prompt returned back and you'll have an output. So it shows us the resource group name, the server name, our database name and the location. It also gives us our current service objective. Our resource ID, the capacity, the SQ name and all those other pieces of information that we may need to use down the road. So to be sure everything is the way we want it to be, we'll come back. We'll come back to SQL pools.

[Video description begins] In the Microsoft Azure Search field, he types sql pools and selects Dedicated SQLpools from the search list. [Video description ends]

We click it and then we see the pool we just created. It's online, the server, the pricing terror and so on. So we can click this dedicated SQL pool will be brought into the console for that SQL pool. And you'll notice it says, your dedicated SQL pools can now be accessed from a synapse workspace.

[Video description begins] The mySampleDataWarehouse window is displayed on the screen. The left pane includes tabs, such as: Overview, Activity Log, Tags, Diagnose and solve problems and various other tabs below Settings and Security. [Video description ends]

So we could click here to create that synapse workspace all at the same time. So I hope this video shows you how to not only install Azure PowerShell but to create a dedicated SQL pool from the Azure PowerShell interface.

Warehousing Workloads

[Video description begins] Topic title: Warehousing Workloads. Your host for this session is Richard Spencer. [Video description ends]

Now let's discuss data warehousing workloads. What are some of the workloads that we can incorporate into our data warehouse? First, continuous data loading. So continuous or near real-time data loading for dynamic data elements. This workload creates an environment much the same as an OLTP database workload, in that all of the indexes and other Optimization structures, have to be continuously updated. Because we're continuously loading data, new data over time. So all these indexes and optimization, also needs to be updated with it. This continuous updating though has a major impact on aggregate and summary data, which is used to populate reports and dashboards. Another data warehousing workload could be batch data loading. So consider what data needs to be refreshed continuously. Not all data is constantly needed in real-time.

So what can wait and what can be batched, what needs to be processed right away? Payroll? You can batch process that. Payroll's not something that happens continuously, you're not paying people every hour, you're paying people every week or every two weeks. So we can batch process that. Customer orders, you may want to do that in real-time. Customer orders is not something that you want to batch and do every day, right? Customers expect that their order will be processed and received in a timely fashion. So you're going to want real time for something like that. Another data warehouse workload will be reporting, what is all this data if we don't use it to perform some reporting? So whether it's used for strategic advantage or just pulling daily financial numbers, reports are the reason why we keep and use so much data. Now, that being said, these reporting systems will need indexing and tuning because we are querying data.

We need indexes, we need to fine tune it. It also needs storage partition. These reporting systems are going to pull back data, we need a place to store it. And optimization structures. We don't want long running queries. We want reports fast. People don't want to wait for these type of things. So if you're going to incorporate reporting, these are things that you're going to need to investigate and find out what will work best for you. So you can really fine tune that reporting system. Business analytics is another workload for data warehousing. So, traditionally, business analysts would use these analytical apps and they would drill down and drill through multidimensional objects and cubes. These users had little to no programming experience.

So bring in today's model of tactical business analysis, that assumes sufficient knowledge of specific business analysis techniques to get the job done. So in this case, business users will rely on BI architects to build commonly used cubes for their work. Ad hoc queries. So, ad hoc queries are created when questions arise that are not able to be solved with a predetermined or predefined data sets, right? True ad hoc queries or data miners need to be supported by the data warehouse. Because these users often use the data in random and unpredictable ways. Again, they're not running predefined queries or querying data on a predefined data set, right?

These are data miners, so they're really going in and they're pulling out very random queries on that data. And it's difficult to optimize or tune these queries. Because again, they're so unpredictable. How do you optimize or tune something that you don't know what's going to be asked of you? Right so the ad hoc queries can be supported in Azure data warehousing solutions, and it's also a very good workload to incorporate into your data warehouse. And finally, OLTP databases. So, analytic and business intelligence are two words we hear a lot. We can divide our IT systems into a traditional OLTP or online transactional processing system, and an analytical or OLAP, online analytical processing.

In general, we can assume that these OLTP systems provide source data to data warehouses, whereas the OLAP systems help to analyze it. So a data warehouse exists as a layer on top of an OLTP database. So if you want to find a way to bring in your transaction processing information and be able to analyze it and perform business intelligence on it that you've never done before. Then put a data warehouse on top of your OLTP database. And you can feed that data into the warehouse and take advantage of all the analytics and business intelligence to gain strategic advantage or see where you're losing money or gaining money. Right, but you want to feed that OLTP data into a data warehouse. And that's another really good example of a data warehouse workload.

Data Warehouse Solutions

[Video description begins] Topic title: Data Warehouse Solutions. Your host for this session is Richard Spencer. [Video description ends]

So when do we know that we need a data warehouse solution? Concepts of data warehouse aren't difficult to understand. A data warehouse is a special purpose database system for collecting data from various sources. And it enables users to make complex queries about that data. Usually it's used in reporting, analysis, and business intelligence. Where data from sales and marketing and other departments of a company are combined to enable strategic decisions. It's a large database used for research and analysis, rather than real-time mission critical applications. So if an online retailer were making use of a data warehouse, they would use that data to figure out things like how good reviews corresponded with sales.

Or what kind of items are trending, how effective other shoppers looked at links are, these types of things. So unlike an operations database, that data does not have to be particularly timely. It doesn't have to be real-time processing, Structured repositories. So what's the difference between a data warehouse and a data mart? A data warehouse holds multiple subject areas in very detailed information and it works to integrate all the data sources together. On the other hand, a data mart is what we call the access layer of the data warehouse environment. We use this layer to get the data out to the users. So the data mart is a subset of the data warehouse.

And is usually oriented to a specific business line or team. Say, finance has a data mart, marketing has a data mart, and so on and so forth. But the key takeaway here is that that data warehouse is basically a relational database schema which stores historical data and metadata, from multiple systems, right? It's a structured repository, and it's aimed to facilitate reporting and analysis. Some benefits of the data warehouse. First of all, integration. You've got multiple different data sources and you need reporting on, they can all be integrated to your data warehouse, and then your data warehouse can combine all that data.

So it integrates with all your different data sources. It holds so much historical data that we can run a bunch of business analysis and business intelligence on data years back. And this can help our organization make crucial decisions for going forward. Data restructuring can be done here. So different data sources means different data formats. We can put them all in one consistent format that can be read by the data warehouse. Reporting, I mean, why would we be doing any of this if it wasn't for the reporting and the analysis and the intelligence that it brings? And the business intelligence solutions. So we can have business analysts in multi-dimensional cubes looking at data from different sources that has been aggregated.

So that's the beauty of the warehouse, we take this source data from multiple different appliances or applications, bring them in, integrate them, view all of our historical data, restructure any data that needs to be. Then we can pool our reporting and do all our business intelligence solutions. So it is a massive asset to any organization who wants to perform a lot of these complex analysis to change the way they use their business, change the products that they offer, whatever decisions that they need to make going forward. End user reports are easily accomplished in business intelligence environments. You can have a developer of some sort create multi-dimensional cubes for the business analysts, right?

End user reports could be automatically sent an email to executives every morning, right? And then that brings us to dashboard. So we can take the data in these reports and flash it up in a dynamic presentation of data, right? So we see these movies where everyone's in network operation centers and they have these big dashboards with all these numbers and graphs. That's real life, we can actually do that. We can do it for the executives, for sales managers, whoever, and they can look at their dashboard. They can see the number of sales, the revenue, and profit, and everything, just at one glance, right? And they can see it all in one spot. And data mining. So data mining tools can be used to help identify hidden patterns, right? Data mining uses a lot of ad hoc complex queries.

And it can pool out data and patterns that we would've never seen before. So if you want to do any data mining, a data warehouse would be a good implementation to start with. Some more data warehouse benefits, secure access. So a data warehouse makes it much easier to provide secure access to data for people who have a legitimate need to see the data. And support for analytical tools, right? So there's many different vendors who have different analytical tools. These tools will work best when extracting data from a data warehouse.

Using Azure Data Lake Analytics

[Video description begins] Topic title: Using Azure Data Lake Analytics. Your host for this session is Richard Spencer. [Video description ends]

[Video description begins] The Microsoft Azure services portal is displayed on the screen. It includes options, such as: Create a resource, Data Lake Analytics, All resources, and various other elements. [Video description ends]

In this video, I'm going to show you how to set up and get started with Azure Data Lake Analytics. So once you're in the Azure console, you can come up here and type in Data Lake Analytics, and you can find your way to the Data Lake Analytics screen.

[Video description begins] In the Microsoft Azure services console, in the Search field he types data lake and selects Data Lake Analytics from the search list. [Video description ends]

So first thing we need to do is create a data lake. So I click the Create Data Lake Analytics button. We have to do the same thing as we do when we create any resource in Azure.

[Video description begins] The New Data Lake Analytics account page is displayed on the screen. It includes two tabs, Basics and Review + Create. [Video description ends]

We provide our subscription and a resource group. So a resource group is used to group together resources with the same lifecycle and same permissions. So I'm going to create a new one and call it datalakerks. Now it's asking me some Data Lake Analytic details. What do I want to give the service name?

I'll call it datalakeservicerks, and you can see the URL come up down below. I'm going to leave my location in East US 2, and I'm going to use my Azure subscription for my existing storage subscription. My pricing package is going to be Pay-as-You-Go, and I also have to create a new Azure Data Lake Storage Gen1. So I enter the service name to create that new Gen1 account. I call it gen1rks. I add it, and I click Next. It brings me into my review. Everything looks good. I've got my storage account name, storage account type, and resource groups. So I go ahead and click Create. You'll see the initializing deployment. And then it'll bring us to a new screen where we can watch some more of the deployment details.

It submitted the deployment, deployment is now in progress. And then we just wait for everything to be provisioned.

[Video description begins] The Microsoft.AzureDataLakeAnalytics page is displayed on the screen. On the left pane, it includes options, such as: Overview, Inputs, Outputs, and Template. The middle pane displays the Overview section. [Video description ends]

So when it's complete, we can click Go to resource, and it'll take us into our service.

[Video description begins] The datalakeservicerks page is displayed on the screen. The left pane contains tabs, such as: Overview, Activity log, Access control, Tags, Diagnose and solve problems, Events and various other tabs under Settings and Getting Started sections. [Video description ends]

So once we're in here, the point is to kind of get this going quick and easy for developers. Okay, so I'm going to come up and click Sample Scripts.

[Video description begins] The Sample scripts page is displayed on the screen. It includes two options, namely: Copy sample data, and Install U-SQL extensions. [Video description ends]

You'll see, in order to use sample scripts, we need sample data. So it says, Click here to copy 50 MB of sample data into your Data Lake Storage Gen1 account. So I'm going to go ahead and click that. You'll see another one, U-SQL Advanced Analytics. That's 2.5 GB of extensions. It does cognitive learning and these kinds of things.

So I'm not going to install that one today right away. I'm just going to wait for my sample data copy to be done. It's done and complete. So once we come back in sample scripts again, we can come down and just do a Query a TSV file, just a very simple script. So I click it, it brings us in, you can see the job name. And you can also see what we call AUs. So that's the number of compute processes that can happen at the same time. Once you increase it, it increases performance. It also increases your cost. All right, so we see our query here. It's a very simple script, it's just a schema being declared here. We're reading from a file, we're using extractors, and then we're outputting it, right? So back to AUS again, if I move that up to, say 32 AUs, it's going to be faster than 1 AU.

From here, we can go ahead and click Submit. You'll see it submitted successfully, it's preparing right now.

[Video description begins] The Query a TSV file page is displayed on the screen. The middle pane contains tabs such as: Job graph, Script, Data, AU analysis, and Diagnostics. The left pane contains the Status information and various other elements. [Video description ends]

If this was a bigger query, we could have scaled up those units, those AUs, and we could have scaled it more containers to run. This is a very simple query so 1 AU was sufficient. As you can see now it's preparing, it's queued up, and now it's getting ready to run. The progress is done. We see the SearchLog.tsv. We see the Extract, it has 23 rows. And it shows you how it flows down into aggregation and then out to our output file, which we can open up and we can take a look at.

[Video description begins] In the flow chart, he clicks the SearchLog_output section. [Video description ends] 

So if we come back into our data lake again, and we just do a Data explorer, we can see we have our data lake here.

[Video description begins] In the top panel, he clicks the datalakeservicerks and then clicks the Data explorer option. [Video description ends]

We can drill down and we can see some of that sample data we put in. So it's a really quick way to get in and start using the Azure Data Lake Analytics. And I hope this video showed you how to create that Azure data lake, how to load some sample data in, and query that data, and view it as an output.

Course Summary

[Video description begins] Topic title: Course Summary. [Video description ends]

So in this course, we've examined the Azure Analytics Workloads. We did this by exploring Azure Synapse Analytics service, Synapse SQL pool best practices, Data Warehouse units, and comparing transactional and analytic workloads, and batch or real time data processing. We also created a Synapse SQL pool using Azure Portal and PowerShell, data warehousing workloads and when to use a data warehouse solution and using Azure Data Lakes Analytics.

In our next course, we'll discuss the components of the Azure modern data warehousing.

Comments

Popular posts from this blog

java chapter11 practice question on abstruct class and interfaces

java practice set 8

java exercise4