This blog is about what Azure Data Factory is and how it relates to SSIS. Today SSIS is the most common tool (in MS sphere) to extract, transform and load data into data warehouses. Many of us have used SSIS in several years now and it’s a mature and stable tool. So why this new Microsoft-tool called Azure Data Factory? Is it just a cloud version of SSIS?
Let see what Microsoft says about Azure Data Factory:
- Compose and manage data services at scale
- Create, schedule, and manage data pipelines
- Visualize data lineage
- Connect to on-premises and cloud data sources
- Monitor data pipeline health
- Automate cloud resource management
Seems like a lot of actions around connection, reading, moving, monitoring and scheduling data. Not so much about transforming. We have to find the answer to why Microsoft want us to use this new tool in the Big Data area. I found following pictures by Reza Rad in his powerpoint Azure Data Factory vs. SSIS and they are very explaining.
This is how we are used to work. First we extract data from sources, mainly tables and text files. Then we transform the data to fit into any structured model like a Dimension Model or Data Vault Model.
And for reasonable large data volumes, SSIS and this kind of infrastructure works fine but when we now look at sources from e.g. social media that delivers big volumes of streamed unstructured data (so called Big Data) then SSIS does not fit as good as before and that´s why there is a need for Azure Data Factory.
We have to say something about Big Data before looking into Azure Data Factory. And when is data called Big Data? If any or several of the following three are V´s are fulfilled, then it is Big Data.
- Large data volumes
- Unstructured data
- Delivered at any time with any periodicity (or streamed)
And what we just do, when its Big Data, is to put the data into the data lake, without structuring the data, and let the analysts take care of the structuring problem. And using SSIS to do that is not optimal so that´s why Microsoft have this new tool Azure Data Factory.
Azure Data Factory
Azure Data Factory lets you create data-driven flows to move data between supported data stores and process data using compute services in other regions or in an on premises environment. It also allows you to monitor and manage workflows.
Azure Data Factory uses the key entities below to define input and output data, processing events, and the schedule and resources required to execute the desired data flow.
A Pipeline is a group of Activities. Together, the Activities in a pipeline perform a task like reading an Azure Blob, run a Hive Query on an HDInsight cluster to partition the log data. Deploying and Scheduling are made on pipeline level.
The Activity is either of the type Data Movement or Data Transformation. Data Movement is what sounds like and Data transformation is a call for C#, Pig and Hive, Azure ML Batch Scoring or a Stored Procedure but there are no built in transformations like in SSIS. Everything must be coded by hand.
Datasets represent data structures within the data stores, which simply point or reference the data you want to use in your activities as inputs or outputs
Linked services are much like connection strings, which define the connection information needed for Azure Data Factory to connect to external resources.
The development is mainly made in the Azure Portal and it’s not as powerful as SSIS at all. All objects are JSON files but there are a wizards that creates the JSON scripts. To create a pipeline that copy data between two Azure Database’s we just have to point out Source and Target server, database and table then the wizard creates the needed JSON scripts for the pipeline, the 2 Datasets and the 2 Linked Servers. The JSON scripts can be inspected afterwards.
Monitor and Manage
Azure Data Factory is also (or mainly) a tool to manage load jobs and we do that in the Monitor and Manage section. Pipelines can schedule to run repeatedly and each execution can be monitored and re-executed. Pipelines can be paused and each execution is then queued and will execute when pipeline is activated again.
SSIS and Azure Data Factory aren’t built for the same purpose, they are complement of each other.
So, when to use Azure Data Factory?
- If you are dealing with Big Data
- If source OR target of data is on cloud
- If you are dealing with Azure Storage
- If Azure is one side of the data