ETL pipeline is definitely an essential part of a data-rich customer-centric world.
Off-the-shelf ETL tools available on the market can simplify data processing and make it easier to handle for business users. So what do ETL data pipeline tools do, and what are their useful features?
In simple words, the aim of data pipelines is to pick all relevant data, apply the required transformations, and deliver datasets to required destinations. They provide organized, ready to use data which is expected by data consumers. For the purposes of this article, we’re focusing on one of the most popular categories of data pipeline architecture: ETL and ELT processes.
Here, we’ll take you through the ETL process, and explain how Matillion ETL and Alteryx, two tools for building an ETL pipeline, might benefit your business. Before moving forward, let’s quickly recap how to understand ETL and ELT acronyms and what the process looks like.
What are the three steps of the ETL data pipeline?
The three key steps of an ELT or ETL data pipeline automate the processing of your data, and are therefore essential to making your data accessible to those who need to use it to make crucial business decisions. They're built upon three steps: Extract, Transform, Load.
Step A: Extraction
The first step in the ETL data process is pulling your data from all the relevant data sources. They can include on-site and cloud-based operational databases, CRM and ERP systems, marketing automation platforms, unstructured files, APIs, webpages and any other resources generating data that users need.
Each of these sources produces data of different volumes, structures and formats, so the extraction step is adjusted to source type. Regardless of this fact, extracted raw data is saved in an organized way to the staging area to make sure it’s ready to use in Step B.
Step B: Transformation
In this step of the ETL data pipeline, all raw data you extracted in Step A is transformed into a format that can be read and understood consistently by the target system. Your required level of data manipulation will be unique to your business and your purpose, but more complex data sources that lack good structure will require more transformation than simple, quality datasets.
Data transformation can cover several operations, such as parsing, merging and joining, cleaning (e.g. removing duplicates / outstanding values, adding missing values, adjusting formats). Any transformations can be applied to satisfy your business rules defining a way to get clean data.
Step C: Loading data
Once transformations have been applied to get clean data, it needs to be loaded into the target system. The target system can be, for example, a data warehouse, analytical platform or any other destination which makes data ready to use for data consumers, including basic users, analytical, business intelligence, and data science teams as well as data-driven applications.
What is the difference between ELT and ETL data pipelines?
ETL refers to a pipeline that consists of the three key steps described above.
- Extract data from the source system(s) to the staging area
- Transform data to get desired information and data structure
- Load data to the destination system(s) (e.g. a cloud data warehouse)
ELT is another pipeline type, which Extracts the data from the source and Loads it to a dedicated space in the target system. Then, using the compute engine available in the target system, data Transformations are applied.
In the ELT process, there is no staging area outside the destination system: you simply load raw data to the destination, perform transformations there, and save the refined output to the area (e.g. database schema) that is available for data consumers.
Architecture for these pipeline types varies, but from the end-users (data consumer) perspective, the result is the same; in both cases, the final dataset is ready to be used. For this reason, many end-users don’t distinguish between them and simply call them ‘ETL data pipeline’.
Why does a business need ETL data pipelines?
As you can see, there is no magic transition that turns raw data into a dataset that is ready for users' needs. Data pipelines are essential to this journey, and someone is required to implement them. An ELT or ETL pipeline can be built as an end-to-end solution by a data engineering team, which can deliver full pipeline customization.
On the other hand, the required types of data sources and transformations for data preparation overlap across many projects. Most frequent tasks are repetitive, so tools to automate data processing can, and should, be implemented to speed up the process and free up data specialists. As the range and scale of data grow, more and more tools to build and maintain ETL data pipelines are necessarily developed.
Data consumers must consolidate all required data sources efficiently in order to harness the data they need for crucial business decisions (for businesses across different sectors, including retail). No-code or low-code ETL platforms can be useful for data engineers to build pipelines faster with reusable components, and they allow less-skilled business users to successfully build pipelines on their own, which shortens the time to delivery of enriched data.
ETL data pipeline tools allow you to focus on pipeline and data transformation setup from a business perspective rather than programming everything from scratch. For example, you don’t have to implement code to extract the data from a source or merge datasets on your own; instead, built-in components simplify these tasks.
ETL data pipeline tools: Matillion ETL vs. Alteryx
The last decade brought several products dedicated to building ELT/ETL pipelines that are now competing for significant market share. The scope of Software-as-a-Service (SaaS) platforms that help companies manage advertising, investing, sales, invoicing, billing, and user analytics is growing at a rapid pace.
In addition, more and more resources have moved to the cloud in order to reduce costs and required maintenance. As a result, there are now plenty of data sources that potentially can be used in building an ETL data pipeline, and, subsequently, useful business analytics.
Modern ETL pipeline tools are no longer limited to well-known on-site integrations (i.e. components to automate data extraction) but cover cloud-based SaaS and Platform-as-a-Service (PaaS) as well.
We’re taking a look at two products available on the market and the possibilities they bring: Matillion and Alteryx.
Matillion ETL is a self-hosted solution, created in 2011. It was built to provide data to cloud-based data platforms with a web application in which you can build, set up, test, and schedule pipelines. The name of this product might be slightly misleading because it actually provides an ELT data pipeline approach, which will be explained later.
The tool can extract data from around 100 data sources. To check if the data source you want to use is supported, you can refer to the documentation (navigate to Orchestration, then Connectors). If you want to extract data from a system not included in the list, Wizard-Driven Universal Connector framework is available to allow you to add a custom data source.
In terms of supported data destinations, Matillion ETL can ingest the data to Snowflake, Amazon Redshift, Google BigQuery, Microsoft Azure Synapse, and Delta Lake on Databricks natively. This solution does not use its own infrastructure but requires it to be installed on the company's cloud resources. The scope of supported destinations is not wide, as the tool focuses on ETL processes for data warehouses and analytical platforms.
To create an ETL data pipeline, you use the web-based application to add tasks and dependencies between each item using a drag and drop graphical interface. It also provides features to test the pipeline, preview results from a given task and fix errors (e.g. add missing tasks’ properties).
Let’s take a quick look at an example of Matillion architecture, applied to the Google Cloud Platform:
As you can see, an instance of Matillion ETL is running on Google Compute Engine service.
In this case REST and JDBC calls are used to Extract data from source systems. Raw data is saved to Google Cloud Storage service and then Loaded to BigQuery. Finally, the Transformation operations can be applied based on BigQuery features.
Matillion ETL in the web app provides many built-in functionalities to transform the data. It also allows you to build custom transformations, and every transform operation is performed using SQL syntax specific to the data destination you operate on.
Here are some possible use cases for this tool:
- A pipeline to get data from several instances of operational relational databases, clean it and provide merged datasets available for analytics in your cloud data warehouse
- Extracting JSON documents from MongoDB instance, loading it to data warehouse and applying transformations to denormalize this semi-structured data
- Extracting weather forecasts from API and populating a dedicated table in the data warehouse to keep historical forecasts, which are then used to train and validate a machine-learning model
- Getting data from platforms like Mixpanel, Twillo, Shopify, etc., to build custom analytics.
Matillion pricing model
Matillion has a credit-based consumption model related to edition type. Prices start at: $2.00 (Basic) / $2.20 (Advanced) / $2.30 (Enterprise) per 1 Credit = 1 Virtual Core Hour.
This pricing model might not sound straightforward, so how does it affect your budget?
Let’s assume that for the initial set-up you require two environments (DEV and PROD), which means you need 2 Matillion instances. DEV instance is running on average 8 hours per day (for development on business days) and PROD is running 24/7. Most of Matillion users are using 2 vCPUs VMs.
Credits required per instance per year = Number of Virtual Cores * daily usage * days.
In the case of our example, we have:
- DEV = 2 (cores) * 8 (development at business hours) * 252 (working days) = 4,032 credits
- PROD = 2 (cores) * 24 (instance running all day) * 365 (calendar days) = 17,520 credits
It means that, for a year of the Basic Edition, you pay $2*(4,032+17,520) = $43,104 in total for DEV and PROD.By default, there is no auto suspend feature (turning off Matillion instance when it’s not used), but there are at least 3 ways to implement a mechanism that will stop the virtual machine on which Matillion is running. Assuming that your data engineering or DevOps team will provide you with such a solution, and it turns out your PROD instance only needs to be up for scheduled jobs for 10 hours per day, costs can be reduced to $2*(4,032+17,520*10/24) = $22,664 for a year of Basic edition.
Please keep in mind that if 2 cores per instance aren’t enough (e.g. 4 will be required) then you will need to recalculate cost estimation. Upgrading the plan to one providing more features than Basic, or adding additional instances, will increase costs as well. You need to remember that Matillion is hosted on your existing cloud infrastructure, which in itself incurs costs.
For an organization requiring multiple heavy workloads, it can be normal for total costs to reach $100,000. We recommend first testing the tool and running some representative PoC pipelines, and then calculating an estimated cost with your data engineering team. While you go through your estimate, you can benefit from a 14-day free trial.
More expensive plans provide useful features like concurrent connections with a warehouse (Advanced Edition or higher), and clustered environments for HA + Git repository (Enterprise only).
Each plan covers 5 users + an unlimited number of read-only users. Each additional developer user costs 50 credits per month, with no minimum or maximum sign-up durations, which means that if for 6 months you need 2 additional users for development, you will pay at least 6 * 50 * 2 * $2 = $1,200.
Pros of Matillion
The tool offers a selection of unique benefits that may suit your business. Regardless of the fact that you can set up Matillion ETL instance with clicks or Cloud Formation Template, which should be natural for this kind of product, the advantages of Matillion are:
- User-friendly drag & drop interface for building pipelines without heavy technical skills
- It’s a self-hosted data ETL data pipeline tool, which guarantees data doesn’t leave the cloud infrastructure you manage. If your company data has to follow some strict compliance rules, this might already be essential.
- It supports more than 100 data source systems including PaaS, SaaS and digital platform connectors
- CDC (change data capture - only for AWS hosting) for RDBMS and data replication are supported
- Adding your own custom data source using Wizard-Driven Universal Connector framework
- It provides direct SQL query processing option for data migration of existing solutions
- 5 developer users by default and unlimited number of read-only users.
Cons of Matillion
Matillion is generally an easy-to-use ETL tool, with few drawbacks. The pricing structure may seem complex at first, but can be quickly picked up. The downsides of the tool are:
- Might be hard to estimate costs accurately without POC runs
- If you’re multi-cloud, you might have to pay for several Matillion instances
- Number of supported destinations is limited because it focuses on cloud data warehouses/similar analytical platforms
- Each additional developer user costs 50 credits per month (i.e. at least $100).
The second tool is Alteryx Designer (desktop application), which provides several options for ETL data pipeline automation. The first one is a drag & drop interface that allows data scientists and analysts to build ETL pipelines, as well as run required jobs, which shortens the time between indicating data sources and getting insights.
Alteryx provides a list of supported data integrations, with around 60 data sources and 50 destinations. The list is lengthy and covers common in-use cloud data warehouses, cloud object storage, SAP HANA, and Spark, as well as analytics tools such as Google Analytics, Power BI, Qlik, and Tableau. ODBC connector is provided, but JDBC is not supported.
By using Alteryx designer, you can build ETL, ELT, and other types of data pipelines, which can consist of more than 300 building blocks for a single workflow. The data is loaded to the instance you work on, and then you can apply a wide range of data transformations, including ‘clean’, ‘parse’, ‘join’, etc. Geospatial functions, data investigation (e.g. profiling, histograms), time-series, and predictive models (e.g. gradient boosting, neural networks) are supported as well.
Alteryx also offers in-database processing, which means operations on data can be performed without moving it out of its database. This feature can provide significant performance improvements as it skips this time-consuming and intensive step.
Many other functionalities are covered, such as sampling, reporting (including generating reports in different file formats), adding elements of a user interface (e.g. text box, drop-down, error message), A/B testing, and running Python scripts. For a full list of possibilities, take a look at Alteryx documentation.
Exemplary use cases for Alteryx Designer include:
- Extracting data from several data sources, preparing data cleaning, and training models using a neural network
- Getting data from Google Sheets and SAP HANA, preparing data for your BI dashboard, and converting it to Tableau extract
- Preparing a data pipeline for reports together with a user interface for generating them
- Many other use cases that are similar to those described for Matillion ETL.
Alteryx pricing model
In terms of pricing, Alteryx Designer is offered as a desktop application, with a basic fee of $5,195 per individual user per year. If you want to share analytical apps, centralize workflows to schedule and run them on a server available for many users, you will need to purchase the Alteryx Server.
The price of Alteryx Server is not displayed online (you will need to contact the provider), but some sources estimate costs of $80,000 per year. The price will differ on how many computing resources you decide to purchase. If you want to test Alteryx Designer first, you can get a 30-day trial license.
Pros of Alteryx Designer
Alteryx is a quick, easy-to-use option for building successful ETL data pipelines without requiring much coding knowledge, and without sacrificing versatility. It offers:
- User-friendly drag & drop interface for building pipelines without heavy technical skills
- Wide range of supported data destinations
- building ETL, ELT, and other data pipelines
- Platform not only for pipelines, but for analytics and reporting as well
- Some ready to use prediction models are available as well.
Cons of Alteryx Designer
The major drawbacks of Alteryx Designer are the limitations in terms of local hosting and restricted operating systems. For many businesses, these drawbacks may mean that Alteryx will fit more seamlessly within their current set-ups:
- Natively supported only on Windows
- The tool is provided as a desktop application and workflows are executed locally, which we don’t recommend for scheduled production pipelines
- Solutions can be hosted on the cloud only using Alteryx Server, which is purchased separately (however, it’s required for pipelines from the above-mentioned data flows)
- Alteryx Server is not hosted on your cloud resources, so compliance requirements at your company should be checked first.
Matillion ETL and Alteryx comparison
In the case of both ETL data pipeline tools, the range of supported integrations is quite long. Matillion ETL is focused mainly on cloud data warehouses, whereas Alteryx offers the additional possibilities of building analyses, machine learning models, providing reporting and Business Intelligence capabilities.
Infrastructure is a key difference between both tools. Matillion is based on owned cloud resources, whereas Alteryx Designer is run locally, but with the possibility of moving all computing to the Alteryx Server cloud. Matillion offers additional options to build some pipelines more effectively, e.g. CDC and data replication.
In terms of pricing, Alteryx Designer has a simple license per-user model, but you need to ask their consultants for Alteryx Server pricing. Matillion ETL model seems to be more complex, but it’s possible to estimate costs using ballpark figures.
Regardless of the differences, these tools fulfill their basic role: building ELT/ETL data pipelines in a visualized way, with ready to use components, commonly used system integrations and without deep data engineering knowledge. To make it even easier, Matillion ETL has an official community similar to this provided by Alteryx.
Process your data efficiently and at speed with ETL pipelines
ETL data pipeline tools are becoming a significant part of the data-driven approach to business. They allow data analysts, scientists and other data consumers to build pipelines and create end-to-end solutions on their own to shorten time to value. Data engineers can benefit from predefined features as well.
Data workflows and transformations for common in-use data sources can be automated easier with either a no- or low-code approach. Both tools described above provide a user-friendly drag and drop interface, with many kinds of building blocks provided by default.
ETL pipeline tools can boost your business productivity, but only if used efficiently. Inexperienced users can build inefficient pipelines, which will increase costs or overload source systems instead of streamlining data processing. Examples of this include failing to set up incremental loading for batch processing, not using data partitioning, forgetting about proper monitoring and alerting, or multiplication of the same data extraction in different pipelines.
For this reason, data administrators and data engineers should be on-hand throughout the pipeline-development process. This is particularly crucial for mission-critical or complex data pipelines that affect the performance of operational systems. Someone keeping high-level track of your data flows can reduce the number of unnecessary further processing.
We should also bear in mind that, before choosing a given tool, the best way to test it is to spend some time experimenting with, exploring, and doing PoC work with the tool (which is possible with trial versions). It should also give you a better insight into required resources and allow you to prepare initial cost estimates. Then you can compare and weigh up the costs, pros, and cons of ETL pipeline tools against custom data solutions that can be built alternatively without them.