UNISYS 2ND ONE Azure Data Ingestion & Processing
Azure Data Fundamentals: Azure Data Ingestion & Processing
Data ingestion and processing allows your data to be assessed, used, and analyzed. In this course, you’ll learn about Azure data ingestion and processing, including loading strategies for Synapse SQL Pools. You'll examine Azure data factory pipelines and activities, as well as how to create an Azure data factory. Next, you'll learn how to use the data factory copy data tool to copy data. Finally, you'll explore how to use SQL Server Integration Services, Azure Databricks, Azure Synapse Analytics, and Azure Data Lake to ingest data. This course is one in a series that prepares learners for the Microsoft Azure Data Fundamentals (DP-900) exam.
Table of Contents
- Course Overview
- Azure Data Loading for Synapse SQL Pools
- Azure Data Factory Pipelines and Activities
- Creating an Azure Data Factory UI
- Using the Azure Data Factory Copy Data Tool
- Using the Azure Synapse Analytics PolyBase
- Using SQL Server Integration Services
- Using Azure Databricks
- Loading Data into Azure Synapse Analytics
- Loading Data Using Azure Data Lake
- Course Summary
Course Overview
Hi, I'm Bill Brooks. I've held various positions in the software field for over 20 years, including computer engineering technology instructor, software developer, software architect, and team lead. [Video description begins] Your host for this session is Bill Brooks. He is a Senior Software Developer. [Video description ends]
I have extensive experience with the Microsoft Development Stack and as a Cloud Developer, primarily in the Microsoft Azure environment. I'm a strong proponent of the Agile approach to development and have been both a PO and a Scrum Master. I hold a Bachelors Degree in Mathematics from Concordia University in Edmonton and a Computer Engineering Technology Diploma from NAIT.
Microsoft's DP-900: Microsoft Azure Data Fundamentals certification is targeted to audiences who have foundational knowledge of core data concepts and how they're implemented using Microsoft Azure data services. Aimed at individuals beginning to work with data in the cloud, familiarity with the concepts of relational and non-relational data, and different types of data workloads such as transactional or analytical, would be considered assets for this certification.
Data ingestion and processing allows your data to be assessed, used and analyzed. In this course, I'll cover Azure Data ingestion and processing, including loading strategies for Synapse SQL Pool. I'll introduce you to Azure Data Factory pipelines and activities and create an Azure Data Factory. Finally, I'll show you how to use the copy data tool to copy data, use SQL Server Integration Services, Azure Databricks, Azure Synapse Analytics and Azure Data lake to ingest data.
Azure Data Loading for Synapse SQL Pools
Synapse Analytics is a powerful analytics tool in Microsoft's Azure cloud platform. In this video, we're going to look at the Synapse Analytics pipeline. We'll identify each step from consuming the data from a source to transforming the data for analytics purposes.
Synapse Analytics supports an Extract, Load, and Transform or ELT pipeline. Data that resides in some external data store, such as SQL Server or Cosmos DB, is extracted from the source, meaning a copy of the data is made that's consumable by Synapse Analytics. That extracted data is then loaded into a dedicated SQL Pool, which is Synapse Analytics equivalent of an internal SQL database. Once the data is in a Synapse Analytics dedicated SQL Pool, queries can then be run on it, transforming the data for reports or dashboards, or to feed to other analysis tools.
So the first step of the Synapse Analytics pipeline actually happens external to the Synapse Analytics instance. Before it can be ingested, it needs to be put in a format compatible with ingestion. Regardless of the data source, be it SQL Server, Cosmos DB, Azure Table Storage or any other data store, a service must translate the data into text files. Now, there are many formats compatible with Synapse Analytics, such as Hadoop file formats, or JSON.
But one of the most popular is Comma-Separated Values or CSV files. These files contain rows of data as text, each data value separated by a comma or some other delimiter. Once the data has been copied into a text file, such as a CSV file, it must then be stored in what is typically called a landing zone. That is a place where Synapse Analytics can ingest it. Synapse Analytics can ingest data from many data stores, including Azure Blob storage and Azure Data Lake Store. Prior to ingesting data into Synapse Analytics, you must define tables in Synapse Analytics to store the incoming data and the schema of these tables may be different than that in the source or landing zone data store.
Somewhere along the pipeline from the source to the Synapse Analytics tables, the data must be prepared, meaning it must be put in the form that best suits the analytics. For instance, the source data might be time series temperature data from a refrigerator unit, but for analysis, you only care about the hourly average temperatures. The data can be aggregated before being loaded into Synapse Analytics.
This can occur anywhere along the pipeline, from source data to loaded data. But it's best to perform this preparation earlier rather than later. For instance, it's much easier to aggregate data in a source SQL database instance than once it's stored as text data in the landing zone data store. The next step in the pipeline is the ingestion or loading the data into Synapse Analytics from the landing zone data store.
This is typically done with Azure Data Factory. And there are several options here, two of which are optimized. You can use PolyBase, which uses Transact-SQL to define not only the loading tables, but also some external tables, which are like views into the incoming data that help with preparation and efficiency. A newer feature is the COPY statement in Azure Data Factory that's more flexible in many ways in PolyBase and may very well supersede PolyBase in the future for the standard method for moving data into Synapse Analytics.
Once data is in the Synapse Analytics SQL Pool staging tables, queries run on the Synapse Analytics instance can then transform the data. This step typically transforms the data into a form that's meaningful for business analysis. And finally, this transformed data is moved to production tables to be consumed by downstream analysis, such as business reports.
This is done with an INSERT INTO and SELECT statement where the INSERT INTO statement determines where the data is going to be placed in production, and the SELECT statement transforms the data. In a standard use case, this may be a step that queries the data and places the results on a quarterly sales report.
Azure Data Factory Pipelines and Activities
Azure Data Factory can be used as an integral part of any data pipeline strategy in Azure since it runs activities and move data from source to destination, as well as supporting activities for preparing that data. In this video, we're going to review how Azure Data Factory works.
First, Data Factory has a concept of a pipeline. A pipeline is a logical grouping of activities that execute tasks such as moving data or transforming data. These activities are connected. So the output of one activity serves as the input of the next supporting the flow of data. The activities in the pipeline are grouped because they serve one logical function.
For example, a pipeline might contain activities for moving landed data from Azure Blob storage to staging tables in Synapse Analytics, while also aggregating the data as part of the larger pipeline for serving aggregated sales reports. So let's take a look at the fundamentals of how Azure Data Factory works. At the heart of data factory is the pipeline management. This logically groups activities that serve a single purpose into pipelines.
You can think of activities or the steps of a pipeline similar to steps of a recipe. Data factory supports three main types of activities. Activities that move data, and these types of activities can both consume data from a wide variety of data stores, as well as write data to a variety of stores. There are data transformation activities which change the data somehow. For instance, there's a Databricks activity that allows Python code to run, which can transform the data.
Perhaps a pipeline that's designed to obfuscate keywords in text messages could consume messages from a source data store, run a Databricks activity to recognize and obfuscate the keywords, and then write the modified messages to a destination data store. This would be three activities representing a single pipeline. And finally, there are control activities which allow for logical controls, such as filtering and looping. For example, the text message pipeline just described could consume several text messages at once, and a control activity could loop to process each message.
As briefly mentioned, Data Factory supports a number of data stores for both source and destination. For example, one common combination is to consume data from Azure Blob Storage, which is a popular landing zone data store, and write the data to Synapse Analytics, a popular data consumer for analytics processing. Data Factory contains a copy tool which lets you easily configure a pipeline between a source data store such as Azure Blob Storage and a destination data store such as Synapse Analytics.
There are many different data transformation activities available in Data Factory, which allow you a great amount of flexibility. Some of these are data flow, which allows you to graphically design a data pipeline. Hive, which allows you to run hive queries on HD inside clusters. The stored procedure activity to run a stored procedure in a database such as SQL Server. A U-SQL activity for Data Lake Analytics, and with the Databricks activity, you can run Databricks notebooks, jar files, and Python code.
Azure Data Factory offers visual wizards for setting up pipelines and associated activities. Both pipelines and activities have a standard JSON format. The pipeline JSON format includes a name and description of a pipeline, an array of activity objects, and other details, such as parameters and the maximum allowed concurrency in the pipeline. The objects in the activities array of the pipeline JSON themselves have activity details like name and description, the activity type, the properties that go with the type and the name of the linked service describing the external data source.
In the description of the pipeline JSON, I briefly mentioned a concurrency setting. In a pipeline, activities may run concurrently or in parallel, which speeds up runtime and makes the pipeline more efficient. What can be run concurrently will depend on the dependencies of the various activities in the pipeline. You should keep concurrency in mind when designing your data pipelines. And finally, once your pipelines are defined, they can be scheduled.
Perhaps you have a pipeline for quarterly reports, and to avoid processing a large amount of data once a quarter, you could schedule the data ingestion to take place once a week. Schedules are triggered for pipelines in different manners. One common trigger is a calendar-style trigger. For example, you may want to ingest quarterly sales data into production every Friday at 5 p.m. You can also trigger pipelines to run on-demand by clicking a button.
Creating an Azure Data Factory UI
Azure Data Factory is a great service for moving data around. In this video, I'm going to demonstrate how to set up Azure Data Factory, I'll show you how to set up a data pipeline in that instance to copy data from an input blob storage container to an output blob storage container. This will show you the basics of how to set up a pipeline and this knowledge can be extended to move or transform data between other types of sources and destinations.
Now, note that I'm going to be performing this demonstration as a subscription administrator. In order to create an Azure data factory instance, you need to be either a subscription administrator or you have to have either contributor or owner role in the subscription. Further, to create subresources in the Azure Data Factory, such as the pipeline components, you must have the Data Factory contributor role.
As an administrator, I automatically inherit that role. So the first step I'm going to do to create this from scratch is to create a resource group. So in Azure portal, I'll go to my Resource groups and I'll click "Add". Now, I'll just choose a subscription I have, which is Free Trial, but if you're doing this, you can choose any subscription. And I'm just going to call this DemoDataFactory as a resource group name, and I'll just default the region. Click "Review + create" and then click "Create". Now that I have my resource group, I'll click on my resource group, and under Overview, I'll click "Add" at the top.
Now, what I want to do is I want to create a blob storage and this is going to be both my source and my destination for my pipeline. So I'll just type in storage and choose Storage account and then click "Create". [Video description begins] The screen reads Create storage account. It features five steps of this process that read: Basics, Networking, Data protection, Advanced, Tags, and Review + create. The Basics step contains a template to fill in project and instance details. [Video description ends]
And under the Basics for Create storage account, it's already chosen my subscription and it's already chosen my resource group. So I need to give it a Storage account name. So I'll just call it BlobStorage, and that's probably not going to work. Yeah.
Because it needs only lower case, so I'll just call it blobstorage. And that's already taken, not surprisingly, so I'm just going to add some numbers on the end to make it unique, blobstorage266. I'm just going to default the location. Default everything else, Performance, Account kind. It's a general purpose. I'll click "Review + create" and hit "Create". [Video description begins] The host skips the other steps and clicks the Review + create button at the bottom left corner of the screen. [Video description ends]
All right. So that's created. So now I'm going to Go to resource. And I'm actually going to back up and go to my Resource groups and click DemoDataFactory and there I see my blob storage. Now the next thing I'm going to do is I'm going to open up my blob storage and I'm going to click on Containers in the Overview and I'm going to create some blob containers.
So at the top, I'm going to click "Container", and I'm going to create a new container [Video description begins] The host clicks the Container button, this opens an overlapping panel on the right side of the screen that reads: New container. It contains a field to specify Name and Public access level, which is set to Private (no anonymous access). Below, there is a Create button. [Video description ends] and I'm going to call it source. This is going to be the source container for my pipeline. I hit "Create". And I'm going to create another one by clicking "+ Container" at the top and I'm going to call this destination for my pipeline.
Now, what we need is something to move between the source and the destination. So I'm going to upload a JSON file, which is going to be what we're going to move with our pipeline. So if I click on the source container, and at the top, I click "Upload", [Video description begins] The host clicks the button Upload, which opens an overlapping panel on the right side of the screen that reads: Upload blob. It contains a File button to open the windows explorer window, [Video description ends] and I'm going to select a file that I have called demo.json, and I'll click "Upload". Okay. So that is now uploaded into my source container.
Now, I'm going to go back to my Resource groups. [Video description begins] The host clicks on Resource groups up in the breadcrumb trail. [Video description ends] I'm going to go to DemoDataFactory. And I'm going to click "Add" at the top and what we're going to do now is add our data factory. So I'm going to type in data factory. And choose Data Factory, click "Create". [Video description begins] The screen reads Create Data Factory it features five steps of a process that read: Basics, Git configuration, Networking, Advanced, Tags, and Review + create. The Basics step contains a template to fill in project and instance details. [Video description ends] Now, under the Basics of Data Factory, I'm just going to default things that already chose my resource group because it's the only one I have, Subscription, I just need to give it a name.
So I'm just going to call it DataFactory. That probably won't work because, yeah, that already exists, so I'll add my dummy number at the end, 266 that makes it valid. I'm just going to default the version, hit "Review + create" and there's one step I have to do for Git configuration before it'll allow me to create it.
If I go back to Git configuration at the top, I have to click "Configure Git later", because I don't have a Git repository to connect to Data Factory. So I need to tell it that I don't want to connect that right now. Click "Review + create" and click "Create". So take a few seconds, but it'll create my Data Factory.
There we go. Now, if I go back to my resource group, I now see, or it might take a few seconds. Okay. It took a few seconds, but my data factory is shown up in the resource group, so I have my blob storage, remember, with my source container and destination container, we have a JSON file in the source container. And now I'm going to use the data factory to create a pipeline that moves that source, JSON, to the destination.
So I'll click on my data factory, and under Data Factory, on Overview I want to click "Author & Monitor", and this will take me to the UI that allows me to set up my pipeline. Now, in this demonstration, I'm going to set up my pipeline manually. So on the left, there are some options. There's one called Manage. If I click on that, there is a section that says, Linked services.
I want to click "New" and create a New linked service. [Video description begins] The host clicks the New button which opens an overlapping panel on the right of the screen. It reads: New linked service. Below the tab data store is selected. [Video description ends] Now, what a linked service does is it connects to your data source. So I have options for Data store, I'm going to choose Azure Blob Storage because that's where my data is.
Click "Continue". [Video description begins] The host clicks the Continue button in the bottom left of the New linked service panel. Now the panel features a template. [Video description ends] And I'm just going to let it to the default name AzureBlobStorage1. That's fine. And I have to set up the Azure subscription here, so I'm going to select my Free Trial under Account selection method and Storage account name is my blobstorage.
So this will connect it to my blob storage. At the bottom, there's a button that says Test connection. So if I click that I can see that it, whether it works. Connection successful. So now when I click "Create", my Data Factory now has a linked service that connects it to my blob storage, so it knows how to talk to the blob storage that I want to interact. All data stores that I want to interact with, I have to create a link service to.
Now the next step is to create datasets. So on the left, if I go up to Author, which is a little pencil and I click that there's Factory Resources and this is where you want to create the parts of your pipeline. So if I click the little plus, which says Add new resource and I choose Dataset [Video description begins] The host clicks on the option Dataset, which opens a new panel to the right that reads: New dataset. There are various tabs to choose from: All, Azure, Database, File, Generic protocol, NoSQL, and Services and apps. [Video description ends]
I'm going to choose my source and my destination datasets, I'm going to create those in the Data Factory. So I'm going to choose Azure Blob Storage once again and click "Continue". [Video description begins] The host clicks the button Continue in the bottom left corner of the panel and now the panel reads: Select format. There are various formats to choose from. [Video description ends] And now it asked me for the format of my data.
Now, if I was to transform the data, I would want to give it the format JSON because perhaps I'd want to change the JSON before I put it into the destination. But in this case, we're just copying so I can choose Binary. In other words, I don't really care what the format is, just treat it like a binary file. Click "Continue". [Video description begins] The host clicks the button Continue in the bottom left corner of the panel and now the panel reads: Set properties. Below, there are two fields to specify Name and Linked service. [Video description ends] I have to choose the Linked service, of course, to tell it where it is and I have to give it a path and there's a little folder beside the path so I can choose.
I can see the contents of my Blob Storage and I can choose source, double click on that and I can choose my demo.json. That is the dataset that I want to link to. Hit "OK", and hit "OK". And now I have to the left, if you see under Factory Resources, under Datasets, I now have a dataset. It's called Binary1 by default. So I'm just going to change that to Source. That's my Source dataset. Now we have to create another one for my destination to tell it basically where to put the data. So I'll hit the plus under Factory Resources again, click "Dataset".
Again, choose Azure Blob Storage, because that's where my destination is. Again, I'll choose Binary and click "Continue". I will choose the same Linked service because in this case it's going to be in the same blob storage. And I will click the little folder and I will choose destination and click "OK" and click "OK" again. That will create my second dataset, which I will call Destination. So now we have our datasets set up.
The final bit is to actually set up our pipeline. So under Factory Resources, I'm going to click plus again and click "Pipeline". Now, pipeline gives you a whole bunch of different functions that you can do with data so you can do more than just move data, you can transform data, you can perform functions and all sorts of analysis on it.
I'm going to open up Move & transform and click Copy data. Actually, I'm not going to click it. I'm going to drag it. [Video description begins] The host drags and drops the Copy data option to the center of the screen in the empty space. Below a series of tabs appears. They read: General, Source, Sink, Mapping, Settings, and more. [Video description ends] And now I have a Copy data method, which I can give details to.
Down at the bottom, it says Source and Sink, and you see there's a little red 1, that means I have to set some things up. So if I click Source, I choose the Source dataset. In this case, it's the dataset I called Source. Now, I'm not going to set anything else. There's other options I can set for this source, but I'm just going to leave it at that. And under Sink, I can choose the Sink dataset, which is Destination.
And now I am going to publish this pipeline. So at the very top of the UI, there's a "Publish all" button and it has a little 3, that means there's three things I've set up, two datasets and a pipeline, which I haven't published yet. So I'll click that. Publish and click "Publish" at the bottom under Publish all. [Video description begins] The host clicks the Publishing button, which opens an overlapping panel to the right of the screen. It reads: Publish all. There is a list with all Pending changes, and below, there is the Publish button. [Video description ends] So that's now deploying my changes and publishing that pipeline. Okay. So publish successful.
So now that I've created my pipeline, I can run it. Now you can click Add trigger at the top and you can add a trigger that triggers this on a schedule. I'm just going to run it right now. So if I click add trigger, I can click Trigger now. And under Pipeline run, click "OK". And now it says Succeeded. So now if I go to Monitor on the left-hand side, and if I go to Triggered under Pipeline runs, I see that I have one run that's successful.
Now, if I go back to Azure and I take a look, I should be able to see that my JSON file is moved. So if I go to my Resource groups, DemoDataFactory, blob storage, and I go to Containers, now, if I look at my source, I should see my JSON file. That's where we put it. And if I look at my destination, there's the JSON file, it's been copied. So in conclusion, I've shown you how to set up an Azure Data Factory and create a pipeline for moving data between a source and a destination. This functionality is often used to move data in a data analytics solution.
Using the Azure Data Factory Copy Data Tool
In this video, I'm going to show you the easy way to set up Azure Data Factory to copy data from a source to a destination using the built-in copy data tool. So in the portal here, I'm just going to go to my Resource groups, and I have a resource group set up called DemoDataFactory, and you'll see that I have some resources set up. I have a storage account called blobstorage and a datafactory.
Now, I've set up my Blob Storage for this. So let me show you what I have. So I open up my blob storage, I'll go to Containers. [Video description begins] The host clicks on Containers in the Overview blade of the blob storage account. [Video description ends] And under containers, I have two containers, a source and a destination.
So what we're going to do today is use Data Factory to copy a JSON file from the source to the destination. Now, if I open up my source container, here's my demo.json, and if I go back to my destination container, you'll see that it's empty. So when we're done, there should be a JSON file here, which was the copied JSON file.
So now if I go back to my resource group and I click on my datafactory, in the Overview of my datafactory, I should be able to click Author & Monitor to go to the UI. Now, here I am in the UI and on the Let's get started page, there are a lot of different tools here that help you through some standard activities. The one we're going to use today is Copy data.
So I'm going to click on that and it opens up a wizard. That's really nice because it steps me through the whole steps of copying data. [Video description begins] The steps of the wizard read: Properties, Source (which is divided into two parts namely Connection and Dataset), Destination (which is also divided into two parts namely Connection and Dataset), Settings, Summary, and Deployment. [Video description ends] Now, under the main Properties, I'm just going to let the Task name be default. I'm going to leave it on Run once now. So it's only going to run this copy activity once as opposed to scheduling it. Click "Next". Now I have to set up a source data store.
Now, this is the connection to my Blob Storage. I'll click "Create new connection" [Video description begins] The host clicks on Create new connection. This button opens an overlapping panel on the right side of the screen that reads: New linked service. There are various services to choose from. Below there is the button Continue. [Video description ends] and I'll choose Azure Blob Storage and hit "Continue". [Video description begins] The New linked service panel now features a template. [Video description ends] Now I have to give it the account information and my storage account information. So, Azure subscription, Free Trial, Storage account name, blobstorage. Now I'll click Test connection, the bottom, and it'll tell me whether it connected, Connection successful.
So I'll click "Create". [Video description begins] The host clicks Create in the bottom left corner of the New linked service panel, and the panel disappears. He is now back in the second step of the wizard, which is Source. [Video description ends] Now, that's created my new connection, called AzureBlobStorage1. Now I'll click "Next". Now I choose the input file or folder. [Video description begins] The host moves on to the second part of the Source step, which is Dataset. [Video description ends]
There's a nice "Browse" button here that I'll click. And I can choose my source folder, double click that. And I will choose my JSON file, that's what I want to copy, so I'll click "Choose". Now I'm going to check "Binary copy" because I want to treat it like a binary file. During copy, I don't want to have to look at the properties of the JSON file to decide what to copy.
I'm just going to copy the whole file. So I choose Binary, click "Next". [Video description begins] The host moves on to the third step: Destination. [Video description ends] Now, my destination data store could be different than the source, but it's not, it's the same blob storage. So I'll choose that and click "Next". And I choose the output file or folder. I'll click the "Browse" button again, and this time I choose my destination folder as the output folder, click "Choose" and click "Next".
Now there are some additional settings, but I'm just going to click "Next" on those and leave them as default. And here's the Summary, Azure Blob Storage to Azure Blob Storage. [Video description begins] The host skipped the fourth step of the wizard, Settings, and is now in the fifth step: Summary. [Video description ends] And it shows me details about my Properties, my Source and Destination. And I have some Copy settings, so I'll click "Next". And now it's validating the datasets, creating the pipeline and running the pipeline, [Video description begins] The host is in the final step: Deployment. [Video description ends] and it's all completed. So that's all there is to it.
It's just copied all the data. Now I'll click "Finish". So that's the nice Copy data tool and let me show you what it actually did. You could actually set this all up manually, but the copy data tool is a lot easier. So on the left, if I click on Author, which is a little pencil, I've created a new pipeline and also two datasets. And that was all part of the wizard.
Also on the left, if I choose Manage, there is what's called a linked service, and that is a connection to my Blob Storage, which we initially set up, so all that was set up for me. Now, let's go back to the portal and I'll just go to my resource group. And let's take a look at what happens.
So, if I open up my blobstorage and I go to Containers now, first, I'll go to my source, remember, my source had the demo.json and I'm just going to open it up to show you what's in it so that we can verify the contents were copied correctly. So if I open up the demo.json and I click "Edit", it has a "Name":"John Doe", and an "Age":25. It's just a simple JSON object.
Now, let's take a look at the destination container. Now, remember, this was empty before and now it has demo.json. So that's been copied. Now, if I click on that and I choose "Edit", I see that the contents are indeed the same as in the source "Name":"John Doe" and "Age":25. So in conclusion, the Azure Data Factory UI contains a copy data tool that streamlines the setup of a linked connection, datasets and a pipeline for copying data from a source to a destination.
Using the Azure Synapse Analytics PolyBase
In this video, I'm going to demonstrate how to use Azure Data Factory to move data from Blob Storage to a Synapse Analytics instance using the PolyBase option. Now, PolyBase is Transact-SQL that extracts data from external data sources. It's very efficient at combining data from those external sources. So I'm starting off here in Azure Portal and I'm just going to create a resource group, if I click on "Resource groups" and I click "Add".
And I'm going to create a Resource group called DemoPolybase, "Review + create" and click "Create". Now, if I go to my resource group, so the first thing we need is a source, which is going to be our Blob Storage, so I'm going to click on "Add" and type in storage.
Choose a Storage account and click "Create". So it's already got my Subscription and Resource group chosen, I'm going to choose this as my Storage account name, demostorage, and I just add 266 to make it unique. And I'll let everything else default, click "Review + create" and "Create". So that'll just take a few seconds to create the storage.
Okay. So that's finished. So now if I go back to my resource group, DemoPolybase, here it is, and I'm just going to click on it. And what I'm going to do now is set up some blob containers, so I'm going to go to "Containers". I'm actually going to create one blob container called source, and that's going to be our source JSON that we're going to copy into Synapse Analytics.
So I'll open up that source container. At the top, I'll click "Upload" and I'll choose to Select a file, and I have one called demo.json. So if I open that up and click "Upload", Now, I'll just show you quickly what demo.json has in it. If I click it and click "Edit", it has a simple JSON object, it has a "Name":"John Doe" and an "Age": 25.
So now I'm going to go back to my resource group and click "Add". The next thing we're going to create is our Synapse Analytics instance. So I'll type synapse and choose Azure Synapse Analytics and click "Create". [Video description begins] The screen reads: Create Synapse workspace. It features a five-tab wizard. The tabs read: Basics, Security, Networking, Tags, Summary. [Video description ends]
Okay. So it's already set up to my Free Trial Subscription and I can choose my Resource group DemoPolybase and I'm going to type in a Workspace name, so this is just going to be synapse266 just to make it unique. That's great, and I have to give it an Account name and a File system name. Now I'm going to create new ones. These are Data Lake Storage accounts that it needs, so I'll just call this account266, just using the 266 to be unique, and file266. Okay.
And I'll also click on the Assign myself the Storage Blob Data Contributor role. Now, on Security, I'm going to go to the Security tab and I want to give the sqladminuser on my Synapse instance a unique password that I know. You can always reset this password after you've created the instance, but I just like to do it right here. Now I'll hit "Review + create". Let's create, I'll hit "Create". Now that'll take a few seconds to create my Synapse instance. Okay. So that's complete. So if I go to my resource group, I now have the synapse266 workspace.
And the final thing I'm going to create is the actual Azure Data Factory. It's going to tie these together. So I click "Add". And I'll type in data factory, [Video description begins] The host clicks the Add button in the Overview blade of the DemoPolybase resource group. [Video description ends] choose the Data Factory, click "Create". [Video description begins] The screen reads: Create Data Factory. It features five steps of a wizard that read: Basics, Git configuration, Networking, Advanced, Tags, and Review + create. The Basics step contains a template to fill in project and instance details. [Video description ends]
And it fills in my Resource group. I'm going to give it a Name of datafactory266, and I'm going to also go to "Git configuration" at the top and I'm going to click "Configure Git later" because I don't need a Git Repository. So I'll click "Review + create", and click "Create". And we'll just wait for that to finish. And now that's done, so if I go to my resource group, I now have a Data Factory. So now what I'm going to do is I'm going to have to create a Destination Database in my synapse.
So if I open up synapse, on the Overview, you'll find under Getting started, Open Synapse Studio. So click "Open". And this will take you to the Azure Synapse Analytics UI. Now in here, I can create databases. So if I go to "Manage", [Video description begins] The host clicks the suitcase button in the left-hand navigation bar. [Video description ends] and a database, it's actually called a SQL pool, if I click "New", [Video description begins] The host clicks the button New under the SQL pool blade. This opens an overlapping panel that reads: Create dedicated SQL pool. It features four wizard tabs that read: Basics, Additional settings, Tags, Review + create. Below in the bottom left corner is the button Review + create. [Video description ends] I'm going to create a dedicated SQL pool.
And I'm just going to call it demo. "Review + create" and I'll click "Create". Okay. So the demo database is created. So now on the left, if I go to "Data", [Video description begins] The host clicks the Data button in the left-hand navigation bar, which is a barrel symbol. [Video description ends] you'll see under "Databases", I'll just click "Refresh", and there's now a database called demo.
And if you take a look at the folders here, I have tables which you'd expect, I also have something called External tables and External resources. Now External tables are used by PolyBase. And what they are basically is like temporary staging tables that PolyBase defines, which make it easier to join data when it's coming into the database.
Now, what we need to do is we need to create a table to hold the data that we're going to bring in. So I'm going to click "New table". [Video description begins] The host clicks on the three vertical dots next to Tables, which open a submenu with two options: New SQL script, and Refresh. The host goes on the first option that reads: New SQL script. This opens a further option which is: New table. [Video description ends]
And this creates a query for me and I'm going to call my table Users because my JSON represents a user, and Users is going to have a name field that can be a string, [Video description begins] In line 3 the host types: name nvarchar (255) NOT NULL. [Video description ends] and also an age field, which is an integer. So I'll click "Run" on that. [Video description begins] In line 4 the host types: age int not null. [Video description ends]
I also have to adjust the column name in the HASH because there's no col1. We will use name as our HASH, "Run", and executed successfully, So now if I Refresh the Tables, I have a Users table with Columns, name and age.
So that's going to be my destination in the demo table. So if I go back to the portal for a second and I'm going to go to my Resource group, and I'm going to go to my datafactory and now we're going to set up the actual pipeline. So if I click on "Author & Monitor" under Overview, it opens up the Azure Data Factory UI.
And I'm going to click "Copy data" and we're going to step through and see how to set this up with PolyBase. So first of all, under the first Properties of the Copy Data tool, I'll just leave the Task name default and we're going to run it once, so I'll leave Run once now, click "Next". Now, I need to create a Source data store. So I hit "Create new connection".
Now, my Source data store is Blob Storage, Azure Blob Storage and I'll click "Continue" and I need to give it some information here under Account. My Azure subscription, and the Storage account name, which is demostorage. And I'm just going to click "Test connection" at the bottom. And that's successful, so I'll click "Create".
So that's my Blob Storage. Now I'm going to click "Next". Now, I choose a file or folder, so I am going to actually choose the json file. And click "Next". Now for File format settings, it already knows it's a JSON format. I'm going to leave it at JSON format and you notice at the bottom it gives me a Preview. And if I click "Schema", I can see it's picked up the Schema of that JSON format, the Name is a string, Age is an integer.
So I'll click "Next". Now it needs a Destination data store. So that's my Synapse Analytics. So I need to Create a new connection and I'll type in synapse and I get Azure Synapse Analytics, click that and "Continue". Now I need to give it my Azure subscription. I need to choose the Server, synapse266, and I should see my Database demo.
Here we go. And SQL authentication, so the User name is sqladminuser. That's the default admin user and my Password is just the password that I gave my Synapse instance when I created it. Test the connection. That's successful, so I hit "Create". Okay. So I have that chosen and I'll click "Next". Now I want to use existing table.
And I want to choose dbo.Users, so this is a Table mapping, it's asking me when I get data from Blob Storage what's the Destination table. That's why we had to create the table in Synapse Analytics. So it's Users. I'll click "Next". So you'll see these field names are the JSON field names, and if I scroll to the right, I can choose Column names in the Destination table.
So I'm going to map name to name and age to age. And click "Next". And you notice on Advanced settings, the Copy method is PolyBase, and you have to have Enable staging checked and we have to choose a Staging account. It uses a Staging account to store the data temporarily while it loads it with PolyBase.
So I'm just going to choose the AzureBlobStorage and I'll choose the same source container and it will just use that as temporary storage. Click "Next". So you'll see that in my Summary, I'm going from Azure Blob Storage to Azure Blob Storage as a staging table, and then it's going to use PolyBase to bring it into Azure Synapse Analytics.
So if I click "Next", it's going to do all of this. It's Creating the datasets, Creating the pipeline and Running the pipeline, and it says it was successful, so I'll click "Finish". So now what I'm going to do is go back to the portal. If I go to the resource group and we go to our synapse and under Overview, I click "Open Synapse Studio".
We'll just take a look at the database and see if our data has been copied. So on the left I'll click "Data", and I'll open up the database demo and open up the Tables. Now, first thing I want to show you is if I open up External tables, there's nothing there. So as I mentioned, PolyBase creates External tables, but Data Factory automatically cleans those up afterwards for you.
So if I click on Users and I do a New SQL script, I can select the TOP 100 rows. And you see, the John Doe is in my table, Age 25, Name John Doe. So just to summarize, what we've done here is we've used Azure Data Factory with the PolyBase option, in order to copy data using temporary External tables from Azure Blob Storage to Synapse Analytics.
Using SQL Server Integration Services
In this video, I'm going to show you how to run SSIS packages in Azure Data Factory. Now SSIS stands for SQL Server Integration Services, and it's a SQL service solution for data workflows. Now, I'm not going to go over the details of SSIS in this video, but I will mention that SSIS packages can be created in Visual Studio, which has a visual drag and drop interface to create a workflow and allows you to write code for the steps of that workflow.
The resulting SSIS package can then be imported into a SQL Server instance, including an Azure SQL database. Today, I'm going to show you how you can invoke an instance of the SSIS runtime environment from within Azure Data Factory and then create a pipeline that can run an SSIS package directly from an Azure SQL database.
So I'm in the portal here and I'm going to create a new resource group for us to work in. So I'll click "Add". And I'm just going to call this SSISDemo. Click "Review + create" and click "Create". Now, in my resource group, the first thing I'm going to do is create a SQL database, so click "Add". And I'll type in sql database. So I'll create a SQL Database instance, click "Create".
And I'll just fill in the Basics here. I'll just call it database and I'll put some numbers on the end to make it unique. I need to create a new SQL Server instance. Okay. So I'm going to give it a simple server name, I'll just call it sqlserver and give it some numbers to make it unique. And you have to put an administrator logging in, so I always use my own bbrooks26 and I have to have a password for that server admin and click "OK".
And then I'm just going to accept the defaults here and click "Review + create", and click "Create". So that'll just take a few minutes. Okay. So that's complete. So, now if I go back to the resource group. I now have a SQL Server instance and my SQL Database instance. Now, if you're going to use Azure Data Factory to run an SSIS package, then your SSIS package would have to be imported into this database.
Now I'm not going to do that today, but I'm going to show you the steps for connecting this to Azure Data Factory. So I need to create an Azure Data Factory, so I'll click "Add". And choose Data Factory and hit "Create". And I'll just give it a simple name, datafactory266, "Review + create". So I need to go to Git configuration and choose Configure Git later, because I don't want a git repository right now.
I'll click "Create". And that'll create our Data Factory. Now, that'll take a few seconds. So we'll just wait for that. Okay. And that's complete. So let's go back to our resource group. Now my Data Factory has shown up here. So what I'm going to do is click on data factory. And click on "Author & Monitor".
This will open up the Data Factory UI. So here we are in Azure Data Factory. So I'm going to create the SSIS runtime environment. If I just go to Manage on the left-hand side, [Video description begins] The host clicks the suitcase button in the left navigation bar. [Video description ends] and then on the left under Connections click integration runtime, and I'm going to create an integration runtime, so I'll click "New" and you see it has the option of Azure-SSIS. [Video description begins] The host clicks on New in the Integration runtimes blade. This button opens an overlapping panel on the right side of the screen that reads: Integration runtime setup. There are two options: Azure, Self-Hosted, and Azure-SSIS. Below there is the button Continue. [Video description ends]
Lift-and-shift existing SSIS packages. So that's what I want to do. I'll click "Continue" and I get the standard, I'm just going to allow the defaults. Click "Continue". And now I have to choose the database that's going to have my SSIS packages in it. It's already chosen for me because it's the only one. The location was
East US is where it is, and it's the sqlserver266.database.windows.net. I just have to give it the admin username and password, which I set up, so that was my bbrooks26. And I can choose a service tier. I'm just going to allow it to default. Click "Continue", and there's some other Advanced settings. I'm just going to click "Continue" and "Create". And this creates my SSIS integration runtime.
Now, I'm going to pause here because it takes a while to start this. It could take up to 20 or 30 minutes. Okay. Luckily, that didn't take 20 minutes. It's started, as you see, it's Running. So what I can do now is I can create a pipeline and I can use this integration runtime to run SSIS packages.
So if I click on "Author" on the left, the pencil and under Factory Resources, I'm just going to click the ellipsis beside Pipelines and choose "New pipeline". And then under General under Activities, there's an option to Execute SSIS package. If I drag that into the workflow, [Video description begins] The host drags and drops the option Execute SSIS package in the workflow. Below, a series of tabs appears. They read: General, Settings, SSIS parameters, and Connection. [Video description ends] now I can execute an SSIS package from my SQL Server.
So if I click on "Settings" below, it's going to ask me to choose my runtime. So I'm going to choose my integrationRuntime1 that I just created. And I'll show you the options here. You have Package location. So the database that it's connected to or you can choose a Package store, Embedded package, so you can actually embed the package right in here. In fact, you can, yeah, you can do an upload.
So this actually uploads a package, so you can upload a package from your local machine and you can also choose a File system which will use a FileShare. So if you have a FileShare setup, you can access the SSIS packages from it. So these are all your options to run SSIS packages in Data Factory, and once you have this set up, if I click "Add trigger", I could run Trigger now or you can set up a new schedule and schedule this. So in conclusion, I've shown you how to set up Azure Data Factory to consume an SSIS package from an Azure SQL Database instance.
Using Azure Databricks
In this video, I'm going to show you how to run Azure Databricks in Azure Data Factory. Now Azure Databricks allow you to execute script such as Python as part of your data pipeline. So I'm starting fresh here in the portal. I'm going to go to my Resource groups and I'm going to create a brand new resource group for us to use.
And I'm going to call it DatabricksDemo. Click "Review + create" and click "Create". So going into my resource group now, DatabricksDemo, the resources we need to do this are a data factory and an Azure Databricks instance. So I'm going to click "Add", and the first thing I'll add is the Data Factory. So I type in data factory and I choose Data Factory, click "Create".
And I'm just going to give it a name and I'm just going to keep it simple, datafactory and I add 266 to make it unique. It's just a random number, "Review + create". And I also have to go to Git configuration and click Configure Git later, because I don't want a git repository right now. Click "Review + create" and hit "Create". So that'll just take a few seconds to deploy. Okay. So that's deployed. So now if I go to my resource group again, DatabricksDemo and click "Add" one more time now I'm looking to create a Databricks instance.
So Azure Databricks, click "Create". And I need to give it a name, and again, I'm going to keep it simple, databricks and I just add some, a number at the end to make it unique. Click "Review + create" and click "Create" and wait for that to finish. Okay. So that's complete.
So now if I go back to my resource group again, and I'm going to go into databricks and what we need to do is we need to set up a Databricks notebook. So I'm going to click on the Overview, "Launch Workspace". [Video description begins] The host clicks the Launch Workspace button in the Overview blade of Databricks266. [Video description ends] This opens up the workspace for Databricks.
Okay. So here in Databricks under Common Tasks, I can click "New Notebook" [Video description begins] The host clicks on New Notebook. A dialog window appears that reads: Create Notebook. It contains fields to specify Name, Default Language which is Python, and Cluster. Below, there is the button Create. [Video description ends] and I can just give it a name.
This is just going to be a demo_notebook. This is going to be a demo notebook. Hit "Create". And I'm just going to put something very simple just so we have a notebook so I can show you what it looks like in Data Factory. So I'm just going to go print "hello world". [Video description begins] The host types print "hello world" in the demo_notebook. [Video description ends] This is the Python notebook.
And now I'm going to exit. So you see in my Workspace, if I take a look at my Workspace, I see my demo_notebook under my user. So if I just go back to the portal, what I'm going to do is go back to my resource group and now I'm going into data factory and I'm going to show you how you tie in Databricks in Data Factory. So I'll click "Author & Monitor" on the Overview. That takes me to the UI. Now, what I want to do is I want to create a pipeline and use my Azure Databricks.
So if I go to Author on the left-hand side, it's a big pencil and under Factory Resources, I hit the plus and I choose Pipeline and it'll create a new pipeline. And what I can do is open up Databricks. It's an option under Activities in Pipeline. And so you see there's a few options. And I'll step you through some of the options. You can run a Notebook, you can run a Jar file or you can run Python script directly.
So if I choose Notebook. [Video description begins] The host drags and drops Notebook in the workflow. Below, a series of tabs appears that read: General, Azure Databricks, Settings, and User properties. [Video description ends] I have some options here to load. Now if I go to Azure Databricks, there's a linked service that I need.
So I'm going to hit "New" because I need to link Data Factory to the Databricks instance. So if I click "New", [Video description begins] The host clicks the New button under the Azure Databricks tab. This opens an overlapping panel that reads: New linked service (Azure Databricks). It features a template. Below, there is the Create button. [Video description ends] I need to give it a few things under Azure subscription. I give it my subscription and I have to choose the workspace and it's my databricks266. Now, Select Cluster, I'm going to go New job cluster.
Now what this does is it'll bring up a cluster, which takes a few minutes. It'll run and then it'll shut it down. I'm actually not going to run it today because I'm not going to get into running the actual Databrick. I'm just showing you how to set it up so you can. So you could go New job cluster and you can run it once or you might have an interactive cluster that's existing and you could use that or an instance pool. Now it needs authentication. So I need an Access Token. Now I can get that from Databricks.
So if I go back to my Databricks screen under my user, under User Settings, I can hit "Generate New Token" and hit "Generate" and it creates a new token for me. [Video description begins] The host clicks the Generate New Token button which opens a dialog window with an empty field and the button Generate. [Video description ends] So I can just copy that and then I will go back to my data factory [Video description begins] The host copies the token that appeared in the empty field of the dialog window. [Video description ends] and paste that Access token in. Now you need to choose some things like Cluster version, Cluster node type, Python Version. Now, it depends on what you're running,
I'm not going to explain all of these because that's not in the scope of this, but I'm just going to choose some just to show you how it works. So I'll choose the very first one. 5.5 LTS. I'll choose a Standard. And Python version, I'll just leave. You would set whatever Python version you want to run .Worker options, Fixed or Autoscaling, I'll just say 1 Worker and hit "Create".
So this creates a link to my Azure Databricks. Now, if I go under Settings on my Pipeline, I now can Browse my notebooks. So if I click "Browse", I can double click on "Users". And now that I'm linked to it, I can go into my user and I can choose my demo_notebook and hit "OK". And you can set parameters and there are some other settings, but those are the main ones as you can choose a notebook. And if you wanted to run it, you can go Add trigger and you could trigger it now.
Okay. Now there's other options here. If I delete the notebook and I drag a Python script, it's the same idea. I choose the link. [Video description begins] The host chooses AzureDatabricks1 in the Databricks linked service dropdown menu. [Video description ends] And then under Settings, I can choose a Python file, and same thing with the Jar files, so you can do notebooks, Jar files or Python files. So in conclusion, what I've shown you here is how to use Data Factory pipelines to run Databricks as either running a Notebook, or a Jar file, or a Python script.
Loading Data into Azure Synapse Analytics
In this video, I'm going to demonstrate how to copy an entire blob container containing blobs of the common schema into a table in a Synapse Analytics dedicated SQL Pool. So here in Azure portal, I have a resource group set up and I want to step you through it. It's called DemoPolybase. And I have some resources set up here, there's one called demostorage266.
This is my storage account that has my blobs in it. It's the source. The destination is the synapse workspace, and specifically there is a Dedicated SQL pool set up called demo and that has a table which will be our destination. I'll show you in a minute. And we have a Data Factory. So if I open up my demostorage storage account, let me show you the blobs I have set up for our source. If I go to Containers, there is a container called source. I open that up. I have four JSON files and let me just show you their contents.
So if I click "Edit" on the first one, you see, it has two fields, a "Name" and an "Age". So "Name" is "John Doe" "Age" is 25 in this case. If I go to the second one, it has the same schema, same properties. "Name" is "Sarah Perkins", "Age" is 54. So just different values. If I go to user3, this is "Bob Dylan", 75. And if I go to user4 this is "Alice Jones", "Age": 18.
So these are the four JSON files, and what I'm going to do is bulk copy those to be records in a table in Synapse Analytics. So if I go back to my resource group, let me show you the destination table. If I click on synapse266, my workspace and I go to Open Synapse Studio, and on the left, I'll just click "Data".
And I'll open up Databases and I have my database demo and in demo under Tables, I have a Users table. Now I'll show you before we start that if I select the TOP 100 rows, you will see that there's no results, so the Users table is empty. What it does have is it has columns that match what's in my JSON. I have a name and an age. Okay. So going back to Azure and going back to my resource group, I'm going to now open up Data Factory and we're going to set up a pipeline to copy that data.
So in Data Factory on Overview, I will click on "Author & Monitor", and I'm going to click on the main screen "Copy data", we use a copy data tool. [Video description begins] The steps of the tool read: Properties, Source (which is divided into two parts namely Connection and Dataset), Destination (which is also divided into two parts namely Connection and Dataset), Settings, Summary, and Deployment. [Video description ends]
Now under Properties, I'm just going to use the task name default and we're only going to run it once, so Run once now. Click "Next". Source data store. So my source is Blob Storage. So if I click "Create new connection", I choose Blob Storage, click "Continue". [Video description begins] The host clicks on Create new connection. This button opens an overlapping panel on the right side of the screen that reads: New linked service. There are various services to choose from, and below, there is the button Continue. [Video description ends] Now, we need to give it some information.
First, I need to tell you what my subscription is, my subscription, and then I have to give it my Storage account name, which is demostorage266. At the bottom, I'll click "Test connection", and Connection successful. So I'll click "Create". And now I have a source data storage. So if I click "Next", now, it says choose the input file or folder, so I'll click "Browse". And I'm not going to choose a single JSON file, instead I'm going to choose the whole folder, source. And I will click "Next".
Now for File format, I have to give it the correct file format, so I'll choose JSON format. And once I do that, you'll see at the bottom I get a Preview, it shows me one of the JSON files, "John Doe" and if I click "Schema", it has picked a part of the schema correctly, notes the columns, name and age, string and integer. So I'll click "Next". Now I need a destination data store. Now this is going to Synapse Analytics and so I'll click "Create new connection", type in synapse, choose Azure Synapse Analytics and hit "Continue".
Now we need to fill in the details so I'll fill in my subscription, choose my Server name, synapse266, and my Database name demo. And then I have to enter my User name and Password. So I'll use the user that's set up when I first created my Synapse Analytics instance, which is sqladminuser, and I gave it a password that I use as a test. So I hit "Test connection". Connection successful, so I hit "Create". Okay. So now, that's my destination data store. Now I will click "Next".
Now we do Table mapping, so we have to map the Blob Storage to the table in Synapse Analytics. I'm going to click "Use existing table" and I'm going to choose the Users table and I'll click "Next". Now, I get a screen that has Name and Age. So this Name and Age is from the Blob Storage. And if I scroll to the right, it asks me to map that to column names in my destination.
So I'm going to map name to name and age to age. And I'll click "Next". Now, all that's left to do is to choose a Copy method. Now, I'm not going to do "Enable staging". I'm simply going to choose "Bulk insert", which is one of the easiest. It's not very efficient, but it's easy to do. I'll click "Next". So you see on my Summary I'm going from Azure Blob Storage and I'm doing a bulk copy to Azure Synapse Analytics.
And I'll click "Next" and now validate. Create datasets, create pipelines and run the pipeline, and it's already finished. So I'll click "Finish". So that's all there is to it. So let's take a look at what happened. Now, if I go to Synapse Analytics and I go to "Data", which I'm already at, and my demo (SQL) database is open and my Tables, my Users, and so if I check TOP 100 rows for users so now you see I have four records, which I would expect one for each JSON object.
Bob Dylan is 75, Alice Jones 18, John Doe 25, and Sara Perkins 54. So in conclusion, I've shown how to bulk copy the contents of multiple blob objects from a blob container to a Synapse Analytics SQL Pool. So the data is available for further analysis.
Loading Data Using Azure Data Lake
In this video, I'm going to demonstrate how to use Azure Data Factory to copy data from Data Lake to Synapse Analytics. Data Lake is a common big data store for datasets that are destined for offline analytics. So this combination of Data Lake as a data source and Synapse Analytics as a consumer makes a lot of sense.
So here in Azure, I have a resource group already set up with some resources in it for this demo. So if I go to DataLakeDemo resource group, I have a Synapse Analytics instance, which is our destination. I have a datalake Instance, which is our source, and I have the datafactory instance to copy data between the two. Now, let me show you my source first, datalake. Let me open that up and I'll click "Data explorer". And I have a folder called source. If I open up source, I have four JSON objects that I've uploaded. Now let me just preview one of them. I'll right-click and hit "Preview" and you see that it's a JSON object with a "Name" and an "Age". So "John Doe" "Age":25.
And they all look like that if I look at user3, "Preview", it's "Bob Dylan", "Age":75. So they all have the same schema, just different data. If I go back to my resource group, and I go to the Synapse Analytics instance, I'm going to click on "Open Synapse Studio", and on the left, if I go to "Data", you'll see that I have one database set up, I call it synapsedb. And if I open it up under Tables, there's a table called Users. So this is designed to be the destination for my source data. It has Columns name and age.
So if I go back, I'm not going to open the Data Factory and I'm going to show you how you map this. So I'll click "Author & Monitor". And I'm going to use the Copy data tool, so I'll click "Copy data". I'm going to default the Task name. I'm going to Run it only once. Click "Next". Now I need to choose a source data store. So I'll click "Create new connection". And my source data store is Data Lake. So I'll choose Azure Data Lake Storage Gen1, click "Continue". Now, under subscription, I have to choose my subscription.
And I have to choose my Data Lake Store account name, datalake266. And now this is interesting, it needs a Managed identity called datafactory266. That's the same name as my Data Factory instance that I'm updating right now. It has a Managed identity that needs access. It says Grant Data Factory service managed identity access to your Azure Data Lake Storage Gen1.
Now I haven't done that yet. So if I hit "Test connection", so that connection is failed and if I click on "More", I see that the reason is Permission denied. So the Data Factory instance does not have permission to talk to Data Lake. So I'll go back to Azure and what I'll do is I'll give it that permission, going to my Data Lake instance, I'm going to click "Data explorer". [Video description begins] The host clicks the Data explorer button in the Overview blade of Datalake266 storage. The next screen features a series of options up at the top that read: Filter, New folder, Upload, Access, Rename folder, Folder properties, Delete folder, Refresh, and Open in Explorer. [Video description ends]
And on my source folder at the top, I'm going to click "Access" and I'm going to click "Add" [Video description begins] The host clicks Access. The next screen features a series of buttons up at the top that read: Add, Save, Discard, Advanced. [Video description ends] and I'm going to add the access that is required. So first, I have to select a user group. So, I'll click "Select". [Video description begins] An overlapping panel opens to the right that reads: Select user or group. There is a search bar. [Video description ends]
And the managed instance for Data Factory, it's got the same name as the instance itself. So it's something we called datafactory266. So I'll choose that and hit "Select". And Select permissions, [Video description begins] The host types datafactory266 in the search bar, and then clicks the Select button below. [Video description ends] I'll give it full permissions, Read, Write, and Execute. And I'm going to add the permissions not just to this folder, because if I did that, it would have access to the folder, but not the JSON files inside.
So I'm going to say "This folder and all children" and I'll click "OK". So that is added and assigned permission to datafactory266. So now if I go back to my copy data tool and I click "Test connection" again, now it's successful. So I'll click "Create". Okay. So my source data store is set up, click "Next". And I have to choose a File or folder, so I'll click "Browse", and I'll choose my source.
And I'll click "Next". Now under File format settings, I need to choose JSON format. Okay. So if I look at the bottom here, I see the Preview "Name": "John Doe" "Age": 25. And it knows the schema. So I'll click "Next". Now it needs a destination data store, and that's going to be Synapse Analytics. So I choose Azure Synapse Analytics, click "Continue". And I need to fill in the details.
So the Server name is synapse266, Database name is synapsedb, and I need to give it the User name and Password of my admin user. So that was sqladminuser. That's the user that was created when I created my synapse database and I'll give it my password and click "Create". So that creates my Synapse Analytics destination, click "Next". Now I need to map the table, so I'll click "Use existing table" and I'll choose the Users table in Synapse Analytics, click "Next". And now I have to map the fields. So the fields in JSON are Name and Age from the source.
And if I scroll to the right, I can choose my column names and my destination table. Click "Next". And what I'm going to do is just to bulk insert. I'm going to disable "Enable staging" and on Copy method I'm going to choose "Bulk insert" and click "Next". So it's going to bulk insert from Data Lake to Azure Synapse Analytics. I'll click "Next". So it does the validating. And it's running the pipeline. Succeeded. So I'll click "Finish".
So now if I go back to Azure and I go to my resource group and then I go to my destination, which is Synapse Analytics, and I'll Open Synapse Studio, then on the left, I'll click "Data". And I'll open my database synapsedb and I'll open the Tables. And on Users, I'll do a SQL script and I'll look at the top 100 rows. And if you take a look, you can see the four rows which represent the four JSON files. So in conclusion, I've shown you how to use Azure Data Factory to copy data from Data Lake into Synapse Analytics for analysis.
Course Summary
So in this course, we've examined data ingestion and processing on Azure.
We did this by exploring loading data strategies for Synapse SQL Pool, Azure Data Factory pipelines and activities, creating Azure Data Factory and using the copy data tool, using PolyBase, SQL Server Integration Services, and Databricks to ingest data, and loading data with Azure Synapse Analytics and Data Lake.
In our next course, we'll move on to examine Microsoft Power BI including data optimization, datasets, dashboards, reporting, visualizations, and workflows.
Comments
Post a Comment