12 Steps to Build a Successful Clickstream Analytics Pipeline

Krystian Dziubiński

Oct 6, 2021 • 16 min read
clickstream_analytics_pipeline

Which and how many pages does a website visitor click on, and in what order? Clickstream data analysis holds the key. Using clickstream analytics to collect, analyze, and report aggregated data tells you the path a customer takes – the clickstream.

E-commerce-based analysis highlights that 98% of e-commerce customers don’t go ahead with a purchase on their first visit to a website or a mobile app. To build a picture of the WHY behind that, clickstream analytics come into play. Retail clickstream data is especially important to gain a holistic view of user behavior, customer experience, journey and personalize the experience via customer segmentation.

Clickstream data records every click a potential customer makes while browsing a website. Tracking these digital touchpoints helps companies understand better how a user moves during a consumer journey. Clickstream uses big data to build a pipeline. How exactly? Read on to find out the intricacies.

Before we move on, it’s important to keep in mind the digital landscape of technology frameworks, tools, and methodologies that are available to build clickstream pipelines is huge. Instead of including everything comprehensively, this article breaks down the major clickstream data analysis components, outlining vital considerations for designing a real time or close to real time clickstream pipeline.

Clickstream analytics pipeline architecture

Clickstream data allows online companies to track consumer behavior and offer a personalized user experience. To facilitate that, clickstream analysis is built using five major components:

  • Data source
  • Data processing
  • Storage
  • Analytics
  • Visualization

Each step in the data flow has its own characteristics and pitfalls to avoid. We’ll describe them in the following sections but, in general, there are three factors to keep in mind when you build a clickstream analysis pipeline:

    • Scalabilty. A system that can manage and store raw clickstream data without increasing costs proportionately.
  • Flexibility. If data structure changes, ideally the pipeline shouldn’t need updating.
  • Safety. If failure occurs, the pipeline upstream is equipped to deal with that.

1. Identify good data sources

Before building a pipeline, consider which big data sources are best for clickstream analytics. To do that, look at the characteristics of the data sources.

By definition, clickstream is a stream of clicks and other user interactions with frontend applications. Therefore, frontend data is the first requirement. Let’s dig in that a bit more.

Front end vs back end as a data source

With the data collection process, it’s common for backend services to produce, store and/or forward a data into data processing pipelines. Backend as a data source comes with several benefits for pipelines, such as:

  • Early data organization and validation
  • Security layer against hacker attacks
  • Lack of frontend user access for the data creation process
  • More limited user rights to tracking policies

However, clickstream data analysis is not about consistency and quality of individual events. Instead, it’s about insights into customer trends, user visits and behaviour discovery, and overall performance reporting. Data quality isn’t that important, unlike operational data use cases, where it's all about precision in business operational efficiency.

Clickstream uses customer trends, volume, and understanding of top events, rather than edge cases and individual instances of behaviour. Therefore, it makes sense to avoid the backend layer and go directly from the frontend into the data clickstream pipeline.

That opens the doors for potential users who want to skew the data, however, it’s possible to implement anti-bot techniques, ensuring data quality.

A few additional records manually generated by users aren't a problem, as clickstream’s main point of interest is volume and trends. Likewise, it’s relatively easier and cheaper to create an API for the frontend to send data directly to the big data pipeline. It's a one way communication, the process and response is less complex – often a single pixel (hence the term pixel tracking).

Clickstream analytics event types

A good data source produces behavioral data about many front end interactions, including:

  • Page loads
  • Clicks
  • Hovers
  • Scrolls
  • Keyboard typing

Page loads and clicks

From a web analytics standpoint, page loads and clicks are two highly distinguishable types of data, relaying different information about user behavior.

Although page loads are not interactions per say, they indicate an important process – clicking on a link with redirection to either an outbound page or a subpage. That distinction makes user journey much easier to understand in the clickstream analysis process.

It’s possible to omit page loads by tracking users clicking on a link on the first page before redirection, but generally, it’s good practice to track that interaction on the page load after redirection.

That comes down to a few factors. Firstly, in the case of a multi-page application, triggering an unguarded click event call causes a full execution stop of all possible scripts, and loss of data, so the next page can start loading. That may cause an early break before the network call to the data pipeline endpoint is made and result in the loss of the event.

Secondly, if the full page contains insight about what the user has seen, not only content impressions context can be added to the events, but also additional information such as page hierarchy, metadata, and page load timings.

Hovers and scrolls

Hovers and scrolls are rarer event types in clickstream analytics. They’re mostly used in UI/UX profiling to produce detailed user journeys and heatmaps. With hovers and scrolls, it's all about making a set of data. It's either packaging events into micro-batches, or constructing events so they contain information about more than one hover or scroll per network call or event.

That must happen prior to sending to the clickstream pipeline, because data traffic can reach extremely high numbers, resulting in expensive computation.

Some user types may produce a huge number of these events because of dynamic interactions with the front end, so packaging these events is crucial for cost as well as performance.

Keyboard typing and clickstream data analysis

When it comes to tracking user data from text boxes, the most important aspects are data security, cleansing, and quality. Users can type anything, from Chinese characters, emoticons and passwords, to card details. Simple search box data may contain all of these examples, even though it’s only expecting to detect search-related text.

It’s important also to remember that this kind of data is usually produced on the next page to minimize noise, only capturing the final text entered in the text box. In the example of the search box, text the user types is produced after the user clicks on the search button and is successfully redirected to the next page.

2. Data - the most important layer

Regardless of whether the data source is from the front or backend, a contract between the data source and data pipeline is highly recommended, in the form of a single structured data object called a data layer.

That way, the consumer (data pipeline) knows what data types and structure to expect, and the producer knows what data to send downstream. This contract is the most vital part of clickstream and data streaming.

There’s significantly more reverse engineering and data analysis required to correct data downstream as opposed to upstream.

In case of any source data malfunctions, it requires a lot of work to identify the issue and fully understand it at the presentation layer. However, at the end of the data pipeline, the data is already processed, making it even harder to identify the time and exact spot of the issue in the process.

Hence, obtaining clear requirements (clear schema) for developers at the source, such as JSON and proper documentation, are a must.

3. Scraping in real-time – a technical debt

If frontend developers aren't available, but clickstream data is needed, scraping may seem like an option. Scraping means reading frontend source code by developing an external code to create messages or events.

Unfortunately, that means the reading code must continuously adapt to the smallest change on the frontend application, and usually the scraping code has no visibility of frontend changes. Additionally, developers managing the scraping code are usually in a different team, division or department.

In practice, you should avoid real-time data scraping at all costs. If a website changes, the scraping code breaks, and there’s no data until it is corrected.

4. Consistent data processing

Errors, bugs, and failures are real-time issues in streaming or micro-batching pipelines. Resilience to failure is essential. Usually it’s better to proceed with incorrect data than no data at all.

That comes down to individual decisions regarding how strict and consistent data processing you require. The greater the pressure on data quality, the more fault intolerant the data pipeline is. It’s advisable to move the most complex operations downstream, and focus on simpler and vital data preprocessing steps early in the data pipeline.

5. Position joins downstream

Joining and aggregating data sources in real-time is extremely difficult. To do so effectively, streaming and data processing applications must wait for data and calculate in the form of windows.

Windowing is possible in different ways; deciding the best method as well as implementing it is difficult. The simplest way? Using fixed time windows and joining or aggregating the data every few seconds, but that decreases the speed of data arrival into the data presentation layer.

The best way is to position data aggregation or joining as a later operation in the stream, after data quality, security, and backup have been ensured.

6. Ensure separation of concerns

As mentioned, clickstream data is volatile and sensitive. Designing the stream pipeline into the microservices architecture with proper separation of concerns is a way to go.

For example, keep data security services, such as hashing and encryption, in more restricted accounts, with limited time access for the developers but easy ways to request. That way, you keep track of who and when accesses the most sensitive part of the data stream.

Separation of concerns is achieved by building microservices that are virtually black boxes for each other. The first microservice carries out its job and pushes data to the next one, without knowing each other's work. A chain of black boxes is created and it's much easier and safer to fix, monitor, and develop.

7. Resilient data backup

When working in real-time, clickstream data processing pipelines deal with tremendous amounts of messages. They are stored at each processing step so the least amount of reprocessing is required if business data rules change, data is in an incorrect state, or of poor quality.

Therefore, the process of backing up and archiving clickstream analytics data must be cheap but also fast in retrieval, in case reprocessing is required. The start of the streaming pipeline includes processes ensuring hashing and encryption prior to any further processing.

Such raw data is usually stored in distributed object systems such as Google Cloud Storage, AWS S3, or Azure Blob. In that form, the data is easily converted into “colder” cheaper storage, in case it's no longer used frequently. That format ensures high resiliency, so data loss is almost impossible.

8. File size does matter

Each message must be stored in a file-like object that enforces metadata, so that each object has a path, size, and format. Size here plays an important role. The worst mistake anyone can make when backing up clickstream analytics data is saving one message per file-like object.

That generates a tremendous amount of objects and metadata, exponentially slowing down the read and update of data processing. Instead, it’s advisable to micro-batch messages before backing them up, ideally in a few MB size objects. That way, data storage is readable way faster and cheaper.

9. Pay attention to good partitioning

Often, data is stored in a folder-like structure called partitions. Each partition has its technical limitations. By dividing data into partitions, you allow for distributed storage and asynchronous data processing, without hitting the storage service limits.

Also, partitioning allows for logical composition of data, and in the case, of reprocessing or data update. It helps select only the data parts the specific processing requires.

The most common partitioning in clickstream analytics is by time, using a year/month/day/hour/minutes structure. That’s the simplest and most manageable way to partition, because clickstream data is usually framed into time bounds as one of the first filters.

Another option is to partition by user and session. While that structure helps data retrieval at the user level, it’s significantly more difficult to make sense of the data when timeframe retrieval is required.

10. Focus on data organization and reporting

At the end of the clickstream analytics data pipeline, normalization is required to minimize data redundancy and make sense of the data. The data must be prepared in the form of view on top of data tables, but before that happens, organized in a Raw Data Vault, where data is normalized into satellite, hubs, and links.

With data organization, we ensure a simple way of preparing a view helpful in building data visualization platforms and creating seamless data dashboards in real-time.

11. Balance between normalized and denormalized data

Clickstream data is extremely difficult to properly normalize in the data warehouse. That’s because it characterizes a huge number of dimensions, resulting in large tables with a large number of columns. Also, the clickstream isn’t reliable enough to consistently normalize the data with other data sources.

Product ID, revenue or basket size, and value aren’t necessarily accurate compared to the data source coming from backend systems where operations are confirmed for business and fully functional.

Therefore, don’t put too much effort into data normalization and treat the clickstream data source as its own entity with unique dimensions that don’t necessarily join with other parts of the business.

For a good balance, keep link tables and hubs only for what’s required for the business at that current point, instead of trying to normalize every single dimension. That way, the data is accessible earlier. However, that only works with modern columnar storage warehouses that can scale easily, and where filtering by columns improves performance significantly.

12. Keep dashboards simple

Once the clickstream analytics data pipeline is processing, storing, and organizing data in data lakes/data warehouses in real-time or almost real-time, then it’s time for building the dashboards. The important thing is keeping the dashboards minimalistic and focused on specific parts of the customer journey on the front end app.

Trying to visualize the complete end-to-end customer journey is difficult, and there’s usually little interest in that sort of information. Instead, it’s better to specify key performance indicators for individual parts of the journey. These can take the simplest form, such as recent or average basket value, number of current users, or a simple funnel with a maximum of four steps to identify where customers struggle to proceed to conversion on the front app.

Cultivating effective clickstream analytics

Real-time and close to real-time clickstream analytics are dependent on consistent, resilient, and use case-focused data processing, with much effort put into data source design and careful implementation at the source.

The data quality cascade is strong in the case of the real time processing, so more effort at the beginning of the pipeline saves later on. In general, it’s not too difficult to build a clickstream pipeline for one frontend application.

The real challenge is to build multiple clickstream analytics systems for corporations that hold a number of frontend applications for different business units, serving multiple customers and their diverse needs. In this scenario, holding to the listed principles has a huge impact on the overall technical and analytical advances across the business.

More posts by this author

Krystian Dziubiński

Krystian Dziubinski works as a Senior Data Engineer at Netguru.
codestories