Snowflake is a great data warehouse for managing data at scale. But what do you do when you need to work with your data in near real time? One possible solution is to use Snowpipe, Snowflake's data streaming feature.
With 2022 coming to an end, it's a fair assumption to say that data is everywhere. Almost every device collects some kind of information, processes it, and eventually sends it somewhere for further manipulation: our TV sets, cellphones, and smart home devices, let alone anything we do on the Internet, including our social media.
That’s why data processing tools have been becoming more and more comprehensive in order to match the always-evolving engineering needs. To keep up with the pace of change, companies often provide services that they wouldn’t bet on in the first place. They don’t want to be left behind and therefore create features to fit as many engineers as they can.
In this article, let’s dive into one of those situations and discuss Snowflake’s streaming functionalities included in this cloud data warehouse solution.
What is streaming anyway?
Let’s start with the opposite of streaming – batch processing – as it’s more natural to think of data processing this way.
We have, for instance, a specific set of files or tables stored somewhere and want to apply some transformations to the data, move it elsewhere, or simply add even more data. Data is loaded in bulk, or in a few batches, processed, and saved again. Pretty straightforward.
But what if the data flow needs to operate in a real-time or near real-time manner? That’s where streaming data comes in.
As mentioned above, streaming is a concept of data processing that allows for continuous intake and parallel computing from multiple sources (e.g., API calls, user session data, IoT sensors).
Because of the different architecture of the solutions, much smaller volume (ca. a few kilobytes per message), and distinct purposes, it enables bringing insights even in milliseconds and enhances real-time analytics.
Several frameworks have been developed to provide convenient processing tools that allow you to stream data like Apache Kafka, Apache Spark Streaming, and Apache Flink, just to name a few.
They undeniably do what they were designed to do. Nevertheless, as engineers’ needs become more sophisticated each day, sometimes it turns out that a perfect streaming solution may not be enough. A demand for a data warehouse, analytical processing, or convenient time travel can quickly arise right next to it.
What is Snowflake?
Snowflake is a cloud data platform that offers data management simplicity. All the services are fully-managed, and there is no need to worry about hardware or software maintenance.
The whole Snowflake architecture consists of 3 layers:
- Snowflake database storage layer
- Query processing layer
- Cloud services layer
The most important is that storage and processing layers are fully decoupled. The concept of virtual warehouses is used, and basically means computing resources that can be utilized between multiple teams or even tasks. The only condition is that a user scheduling commands has access to particular objects in the Snowflake data warehouse, that’s all.
Apart from that powerful distinction between processing and storage, Snowflake also has multiple advantages compared to other solutions in the market. Some of those can be found below:
- Ease of use and full automation
- Extra features, like time travel, zero-copy data sharing, or metadata management
- Security and encryption
- Cost management
- Comprehensive flexibility
If any of those features sound like a huge benefit for you, you should consider Snowflake as your potential data warehouse solution.
Streaming with Snowpipes in Snowflake - a compelling crossover
After a brief discussion above, Snowflake seems to be a clear example of a data warehouse oriented more towards analytical tasks across multiple teams. That’s correct since this tool has been designed mainly for OLAP (Online Analytical Processing) workloads.
Nevertheless, as data has already become ubiquitous, tools like Snowflake have to be prepared for pivots to provide new types of functionalities.
That’s why, in December 2018, Snowpipe, a Snowflake streaming-like feature was officially released. And even though it doesn’t allow integration directly with Apache Kafka data streams, it still enables near real-time processing of available data.
According to Snowflake documentation, "Snowpipe enables loading data from files as soon as they’re available in a stage. This means you can load data from files in micro-batches, making it available to users within minutes, rather than manually executing COPY statements on a schedule to load larger batches."
Executing possibly thousands of COPY statements (to load data into Snowflake) is not the only difference from a standard bulk-batch data load. In the documentation we can also find a distinction in the following areas:
- Security: when calling REST endpoints, Snowpipes require a key pair authentication with a JSON Web Token
- Load history: stored in the metadata of the pipe for only 14 days (compared to 64 days for bulk ingestion)
- Transactions: loads are combined or split into single or multiple transactions based on the number and size of the rows in each data file (whereas bulk loads are always a single transaction)
- Copy resources: Snowpipe uses compute resources provided by Snowflake (no need to think about virtual warehouse setup)
- Cost: billed only for Snowflake’s compute resources used to load data
If a company collects data from different sources frequently, Snowpipe seems like a great fit to stay within our data warehouse solution.
Snowpipe in practice
In terms of usage difficulty, running a Snowpipe isn’t that different from plain COPY statements that are used on a daily basis in batch data.
Apart from some specific parameters, like
AWS_SNS_TOPIC, it is a
COPY statement, which is passed at the end of a pipe definition. After creating one, Snowflake continuously monitors a staging area where new data can be ingested and processed with a brand new Snowpipe.
The syntax of a Snowpipe for AWS can be found below:
CREATE PIPE mypipe_s3 AUTO_INGEST = true AWS SNS TOPIC = 'arn:aws:sns:us-west-2:001234567890:s3_mybucket' AS COPY INTO snowpipe_db.public.mytable FROM @snowpipe_db.public.mystage FILE_FORMAT = (type = 'JSON');
Streams vs. Snowpipe in Snowflake
What can be misleading at the first glance is that, apart from Snowpipes, Snowflake Streams also refer to streaming features, especially one called Change Data Capture (CDC). When established, they monitor DML operations on a table or a view. It can be interpreted as table versioning – keeping up with all of the transactions performed.
The idea behind the solution is quite straightforward. A stream, as a Snowflake object, takes a snapshot of data at some specific point in time. It’s called offset and is simply a current version of the source object. Then, the implemented change tracking system records information about all DML transactions that happen since that offset.
Apart from the data itself, it also collects information about the type of the operation – whether it was an insert, update, or delete. Thanks to that metadata, users can properly react with some further processes, like updates in other tables or running specific functions.
Not so obvious way for streaming in Snowflake
In this article, we discussed an unpopular approach to connecting data warehouses with streaming functionalities. Even though it’s not a natural application of a data warehouse, Snowpipes and Streams have a clear purpose in a Snowflake ecosystem.
Catering to the market’s demand, they perfectly fill the gap for near real-time data ingestion inside a tool focused more on OLAP workloads. With data increasing in volume and velocity every day, it can turn out to be a part of production pipelines quite soon.