In this long read we take a look at Azure Data Factory (ADF). As a data integration tool it can be used for cloud data warehousing as well as data analytics and data lake storage. This long read is sampled from Dmitry Anoshin’s and Dmitry Foshin’s book Azure Data Factory Cookbook.
Microsoft’s Azure platform offers modern organizations a range of different cloud-based services, organized into several key components positioned around compute, storage, database and network requirements. These components serve as building blocks for any organization wishing to gain the benefits of cloud computing, which include utilities metrics, elasticity, security, and more. Many organizations across the world already benefit from cloud deployment and have fully moved to the Azure cloud. They deploy business applications and run their business on cloud. As a result, their data is stored in cloud storage and cloud applications.
Microsoft Azure also offers a cloud analytics stack that enables the creation of modern analytics solutions, extract data from on-premise and cloud applications, and use data for decision making by searching for patterns in data and deploying machine learning applications.
In this brief introduction to Azure Data Platform services we focus in particular on the main cloud data integration service – Azure Data Factory (ADF).
Meeting the Azure Data Platform
The services making up the Azure Data Platform (ADP) are summarized in the table below:
|Azure Synapse Analytics||Limitless analytics service with unmatched time to insight (formerly SQL Data Warehouse)|
|Power Bi||Business Intelligence solution for building reports, dashboards and data visualization|
|Azure Data Lake Analytics||Distributed analytics service that makes big data easy|
|Azure Stream Analytics||Real-time analytics on fast moving streams of data from applications and devices|
|Azure Data Factory||Hybrid data integration at enterprise scale, made easy|
|Azure Databricks||Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform|
|Azure Cognitive Services||Cloud-based services with REST APIs, and client library SDKs available to help developers build cognitive intelligence into applications without having direct artificial intelligence (AI) or data science skills or knowledge|
|Azure Event Hubs||Big data streaming platform and event ingestion service|
|Azure Data Lake Storage||Set of capabilities dedicated to big data analytics, built on Azure Blob storage|
|Azure HDInsight||Provision cloud Hadoop, Spark, R Server, HBase, and Storm clusters|
|Azure Cosmos DB||Fast NoSQL database with open APIs for any scale|
|Azure SQL Database||Managed, intelligent SQL in the cloud|
Using ADP services it is possible to build a modern analytics solution that is secure and scalable. The following image shows an example of a typical modern cloud analytics architecture.
Modern Analytics Solution Architecture
Most of the Azure Data Platform services can be found in this diagram. ADF is a core service for data movement and transformation.
To delve deeper into the reference architecture in Figure 1, you can see that starting with source systems, you can collect data from files, databases, API, IoT and so on. Then we can use Event Hubs for streaming data and ADF for batch operations. ADF will push data into Azure Data Lake as a staging area and then we can prepare data for analytics and reporting in Azure Synapse Analytics. Moreover, we can use Databricks for Big Data processing and ML models. Power BI is the ultimate data visualization service. Finally, we can push data into Azure Cosmos DB in case if we want to use data in business applications.
Rolling up our sleeves…
There’s nothing like interacting with a platform to get a feel for how it works and what it can offer. In that spirit we are now going to create a free Azure account, log in to the Azure Portal and locate Azure Data Factory services. (If you have an Azure account already, you can skip creation of the account and login into the portal.)
First, go to https://azure.microsoft.com/free/
1: Click Start for free
2: You can Sign in into your existing Microsoft account or create a new one. Let’s create one as an example
3: Enter your email address and click Next
4: Enter a password of your choice
5: Verify email by entering code and click Next
6: Fill the information about your profile (Country, Name and so on). Moreover, it will require your credit card information.
7: After you have finished the account creation, it will bring you to the Microsoft Azure Portal
8: Now we can explore Azure Portal and find Azure Data Services. Let’s find the Azure Synapse Analytics. In the Search bar enter Azure Synapse Analytics and choose Azure Synapse Analytics (formerly SQL DW) and click Enter. It will open the Synapse control panel.
Here we can launch new instance of Synapse data warehouse.
9: Let’s find and open the Data Factories. In the next recipe we will create new data factory using same US.
Before doing anything with ADF, let’s just review what the Azure account offers. With a free Azure account, we can benefit from:
- 12 months of free access to popular products
- $250 credits
- 25+ always free products
You won’t be charged unless you choose to upgrade.
Azure Portal is a friendly UI where we can easily locate, launch, pause or terminate the service Except UI, Azure offers us other ways of communication with Azure services using Command-line Interface (CLI), Application Programming Interfaces (APIs), Software Development Kits (SDKs) and so on.
Using Microsoft Azure portal, you can choose Analytics category and it will show you all analytics services, as shown in the following image:
Azure Analytics Services
We just located ADF in Azure Portal. Next, we should be able to create an ADF job.
Creating and executing our first job in ADF
Azure Data Factory (ADF) allows us to create workflows for transformation and orchestration data movement. You may think of ADF as an ETL (short for Extract, Transform, Load)/ELT tool for the Azure cloud and Azure Data Platform. ADF exemplifies the Software as a Service (SaaS) model. This means that we don’t need to deploy any hardware or software, and we only pay for what we use. Often ADF could be referred to as a code-free ETL as a service. The key operations covered by ADF are:
- Ingest – allows us to collect data and load into Azure Data Platform storage. Moreover, ADF has 90+ data connectors.
- Control Flow – allows us to design code-free ETL.
- Data Flow – allows us to design code-free data transformations.
- Schedule – allows us to schedule ETL jobs.
- Monitor – allows us to monitor ETL jobs.
Now let’s try them out! We will continue from where we left off, when we found ADF in Azure Console. We will create a data factory using a straightforward method – ADF UI via Azure Portal UI. It is important to have sufficient permissions to create a new Data Factory. In our example, we have super admin rights, and we should be good to go.
During the exercise we’ll create a new resource group. A resource group is a collection of resources that share the same lifecycle, permissions, and policies.
Let’s get back to our Data Factory.
1: If you have closed the Data Factory console you should open it again. Search for ‘Data factories’ and click Enter.
2: Click Create data factory and it will open the project details where we will choose Subscription (in our case Free Trial).
3: We haven’t created a resource group yet. Click Create new and type the name ADFCookbook. Choose the Region as East US, give the name as ADFcookbookJob1- and leave version V2. Then click Next: Git Configuration.
4: We can use GitHub or Azure DevOps. We won’t to configure anything and we will mark Configure Git later. Then click Next: Networking.
5: We have an option to increase security of our pipelines using Managed Virtual Network and using Private endpoint. For our book, we will use default settings. Click Next.
6: Optionally, you can specify tags. Then click Next: Review + Create. ADF will validate our settings and will allow us to click Create.
7: Azure will deploy Data Factory. We can choose our Data Factory and click Author and Monitor. It will open Data Factory UI home page where we can find lots of useful tutorials and webinars.
8: From the left panel choose the blue pencil as shown in the following image and it will open the window where we will start the creation of pipeline. Choose New pipeline and It will open the pipeline1 window where we have to provide the following information – input, output and compute. Add the name ADF-cookbook-pipeline1 and click Validate All.
Azure Data Factory Resources
9: While executing Step 8 you will find that you can’t save the pipeline without the activity. For our new data pipeline, we will do a simple copy data activity. We will copy file from one blob folder to another. Here we won’t spend time on spinning resources like databases, Synapse or Databricks. In order to copy data from blob storage, we should create an Azure Storage Account and blob container.
10: Let’s create the Azure Storage account. Go to All Services > Storage > Storage Accounts.
11: Click +Add
12: Use our Free Trial subscription. For resource group, we will use ADFCookbook. Give a name for the storage account such as adfcookbookstorage then click Review and Create.
13: Click Go to Resource and select Containers
Azure Storage Account UI
14: Click +Container and enter the name adfcookbook.
15: Now, suppose we want to upload a data file called SalesOrders.txt file. Go to container adfcookbook and click Upload. We will specify the folder name input. We just uploaded file to the cloud! You can find this with path /container/folder/file – adfcookbook/input/SalesOrders.txt.
16: Next, we can go back to ADF. In order to finish the pipeline, we should add an input Dataset and create a new Linked Service.
17: In ADF studio click the Managed icon from left sidebar. It will open the Linked Services. Click +New and choose Azure Blob Storage, click Continue.
18: We can optionally change the name or leave it by default, but you have to specify the subscription and choose the storage account that we just created.
19: Click Test Connection and if all is good click Create.
20: Next, we will add a Dataset. Go to our pipeline and click New dataset as shown in the following image:
Azure Data Factory Resources
21: Choose Azure Blob Storage and click Continue. Choose Binary format type for our text file and click Continue.
22: Now we can specify the AzureBlobStorage1 Linked Services and we will specify the path to the file adfcookbook/input/SalesOrders.txt and click Create.
23: We can give the name of the dataset in Properties. Type in SalesOrdersDataset and click Validate all. We shouldn’t encounter any issues with data.
24: We should add one more dataset as an output for our job. Let’s create a new dataset with the name SalesOrdersDatasetOutput.
25: Now, we can go back to our data pipeline. We couldn’t save it when we created without a proper activity. Now, we have everything we we need to finish the pipeline. Add the new pipeline and add the name ADF-cookbook-pipeline1. Then, from the activity list expand Move & transform and drag and drop Copy data step to the canvas.
26: We must specify the parameters of the step – Source and Sink information. Click the Source tab and choose our dataset SalesOrdersDataset.
27: Click the Sink tab and choose SalesOrdersDatasetOutput. This will be our output folder.
28: Now we can publish two datasets and one pipeline.
29: Then we can trigger our pipeline manually. Click Add trigger as shown in the following image:
Azure Data Factory canvas with Copy data Activity
30: Select Trigger Now. It will launch our job.
31: We can click on Monitor from the left sidebar and found the pipelines runs. In cases of failure, we may pick up the logs here and find the root cause. In our case the pipeline ADF-cookbook-pipeline1 succeeds. In order to see the outcome, we should go to Azure Storage and open our container. You can find additional folder output and file name SalesOrders.txt.
Congratulations, you have just created a first ADF job using UI!
How it works
Using Data Factory UI we created a new pipeline – an ETL job. We specified input and output datasets and used Azure Blob Storage as a Linked Service (LS). The Linked Services itself is a kind of connection string. ADF is using LS in order to connect external resources. On the other hand, we have datasets. They represent data structure for data stores. Our job performed the simple activity of copying data from one folder to another. After job run we reviewed the Monitor section with job run logs.
Data Factory pipeline is a set of JSON config files. We are using UI to create the configuration file and run the job. You can review the config JSON file by downloading as shown in the following figure:
Downloading pipeline config files
It will save the archive file. Extract it and you’ll find a folder with the following subfolders:
- Linked Service
Each folder has a corresponding JSON config file.
Azure Data Platform’s various services allow you to build the kinds of high-performance, Cloud-based analytics solutions that modern organizations and businesses increasingly depend on. The starting point is to get your data onto the platform, and that’s where Azure Data Factory (ADF) comes in. Here we’ve just hinted at the functionality on offer. A fully rounded service for integrating data from diverse sources and bringing them into the ADP cloud, ADF allows you to upload and integrate your data with the minimum of fuss. Then the analytics fun can really begin.
Packt have a number of titles that cover Azure. Browse them here or pick from some of the titles below:
Solve real-world data problems and create data-driven workflows for easy data movement and processing at scale with Azure Data Factory
○ Learn how to load and transform data from various sources, both on-premises and on cloud
○ Use Azure Data Factory’s visual environment to build and manage hybrid ETL pipelines
○ Discover how to prepare, transform, process, and enrich data to generate key insights
Find out how you can leverage virtual machines and load balancers to facilitate secure and efficient networking
○ Discover the latest networking features and additions in Microsoft Azure with this updated guide
○ Upgrade your cloud networking skills by learning how to plan, implement, configure, and secure your infrastructure network
○ Provide a fault-tolerant environment for your apps using Azure networking services
Implement real-world DevOps and cloud deployment scenarios using Azure Repos, Azure Pipelines, and other Azure DevOps tools
○ Improve your application development life cycle with Azure DevOps in a step-by-step manner
○ Apply continuous integration and continuous deployment to reduce application downtime
○ Work with real-world CI/CD scenarios curated by a team of renowned Microsoft MVPs and MCTs