The Hitchhiker’s Guide to the Modern Data Stack

on 27.02.2024 by Kirsten Hipolito

Person in a data center

Hello, weary traveler. You’ve likely read about the modern data stack, a set of components to collect and analyze data, optimized for maximum accessibility, scalability, maintainability, and robustness. 

In this article, we’ll outline four key factors you should consider when selecting the tools in your modern data stack: your people, your data use cases, your budget, and your future. 

First things first: What is a modern data stack anyway? 

The idea of a modern data stack is that you can set up an easy-to-use and easy-to-maintain stack of tools that support a wide range of use cases, without necessarily needing a dedicated data team, a corporate-level budget, or a paranoid android to do all the calculations for you. 

As with any data stack, this includes tools for data ingestion, data storage or warehousing, data transformation, and a business intelligence/visualization layer. Beyond this minimal setup, there are many other tools available that can be added based on your company’s needs, such as data cataloging, data governance, and orchestration. 

The tools that make up a modern data stack often include: Data sources, data ingestion, data storage/data warehouse, data transformation, and visualization.

The sheer number of different tools available for each component, let alone the entire stack, can make the decision process feel overwhelming. Though most popular tools fit the criteria of a modern data stack, they differ in terms of which use cases and users/stakeholders they are best suited to.

If you want to learn more about specific use cases like visualizing marketing campaign performance or other case studies, you can check out our project case studies here.

To help you make your choice, let’s dive into this blog post following the four key factors: 

Your people: At the heart of your data stack choice 

People are the heart that keeps everything running, so it follows that they are central to your decision. Here are a few key questions you can ask yourself to help identify the right tools for your needs. 

Who are the main stakeholders using the insights? 

Consider their technical skills, data maturity, the tools they’re familiar with, and their time constraints and willingness to adopt a new system. 

The data maturity of your stakeholders (or end users) should factor into your choice of visualization tool in particular; out of your whole stack, it’s the tool they’ll interact with the most. If your stakeholders aren’t technical, dashboards built using more familiar tools like Tableau or Power BI may be sufficient, and easier to onboard new people to. 

More data-savvy stakeholders, however, will likely appreciate the flexibility of self-service analytics tools like ThoughtSpot, Veezoo, or even Hex for those who are hands-on. 

Who will be maintaining the data and its associated pipelines? 

The overhead work and maintenance for an interconnected stack of tools can be managed in different ways. 

If you already have people in your team with technical skills and experience, such as a fully-fledged in-house data team, you can set up and deploy the tools independently. This saves on infrastructure costs and gives you much greater flexibility and control over your setup. 

On the other hand, vertically integrated solutions like Y42 are a better fit for companies will small teams and limited bandwidth. Since most of the tools in the stack come pre-bundled and ‘co-managed’ in one platform, one or two people can maintain the data integration, storage, and transformation with relative ease. 

You also have the option of getting consultants on board (like us!) to help decide on and implement the stack with you, and help you with the maintenance. 

 

Learn more about our data product services here.

 

Your use cases 

As the hyper-intelligent mice found out with Deep Thought, knowing the right questions to ask your data is just as, if not more important, than getting the answer itself. So, think carefully about your use cases, or in other words, what you want to do with your data. 

How should your stakeholders interact with the data, and how often? 

Most companies start by creating dashboards summarizing key business metrics. These reports usually only need new data to be refreshed a couple of times a day. A good example would be tracking the long-term performance of your social media campaigns across different platforms.  

This use case is already well-served by most cloud tools like Airbyte or Fivetran for data integration, dbt for modeling, BigQuery or other cloud data warehouses for storage, and BI tools like Power BI or Tableau for visualization. 

How do you envision stakeholders utilizing the data further? 

Organizations with a higher data maturity level often look for ways to use their data beyond reporting metrics. One such use case would be sending modeled data from the data warehouse back to the business tools. For example, sending each customer’s purchase history to your customer service tool. This is usually referred to as data activation and is served by Reverse ETL tools such as Census. 

Other more complex use cases include real-time analytics, forecasting and prediction models, and AI applications, which all require either additional tools or frontend/backend development. 

 

Read our modern data stack case study.

 

Your budget: The right modern data stack doesn’t always mean spending a lot of money

Arguably the biggest factor for many companies is the cost, which then leads to the question: 

How much money are you willing to spend, and keep spending, on your data stack? 

Data comes at a cost. The cost visible upfront is that of the initial setup and implementation, but running pipelines and consuming data brings recurring costs no matter which setup you go with. You’ll either need to pay for subscription fees to cloud tools or server costs. 

Cloud platforms charge for data ingress and egress, storage, and compute capacity. This means that collecting, storing, modeling, and consuming your data are all billed, especially if you don’t use native integrations and visualization tools. 

How much time are you willing to spend? 

As your business grows, and your team evolves with additional tools in their workflow, the maintenance cost of adding new data sources, fixing bugs, and adjusting business logic should also be kept in mind. 

Deploying the open-source version of tools on your own servers is cheaper and gives you more control over your data models and pipeline. However, it is more complex to implement and time-consuming to maintain than using the paid cloud-based version of these tools. It then becomes a question of time to market and opportunity cost. Ask yourself which is more important in your case: delivering new insights quicker, or minimizing the running costs? 

 

Your past/your future 

Finally, the existing infrastructure you already have, and what you anticipate you’ll need in the future, should factor into where you choose to deploy your stack. 

What do you already have in place? 

If your organization is already mainly using a cloud platform, your tool decisions are very often narrowed down to those natively offered or compatible with your cloud. 

If you don’t use a cloud platform, don’t panic! Implementing a modern data stack doesn’t necessarily require overhauling your legacy systems and moving everything to the cloud. 

There are strong arguments for cloud-based tools though, especially cloud-based data warehouses such as BigQuery or Snowflake. Analytics workloads are usually more performant compared to traditional relational database management systems (RDBMS) like PostgreSQL.  

For high-volume data in the tens of millions of rows, such as event-level web analytics data, the difference in processing time could be minutes versus hours. However, with fine-tuning, configuration, and diligent database maintenance, traditional RDBMS running on your existing servers can also be fast while remaining cost-efficient. 

What tools will you need in the future? 

Whichever infrastructure you choose, it’s important to ensure your stack is future-proof. The main elements to consider are: 

Accessibility

You should own and have access to your data, and the pipelines and models associated with it. For this reason, no matter which specific tools you use, we always recommend having your own data warehouse. Whether it’s on-premise or in the cloud, ensure you have properly synced and tested backups. Avoid keeping your data in the vendor platforms, with no SLAs on data retention. 

Maintainability

You should be able to maintain the pipelines and models easily. Data modeling tools like dbt provide a SQL framework that standardizes data models. They are usually compatible with version-controlled repositories such as a GitHub repository, encouraging traceability and maintainability. 

Scalability

As your organization grows, the tools in your modern data stack must be able to scale with you. They should be interoperable, in the sense that if a tool no longer serves your needs, it should be able to be easily replaced without having to rebuild the whole stack from scratch. 

 

Recap

Ultimately, your choice of tools is an important decision that should not be taken lightly. It’s a decision you and your team will work with and pay for for years to come. Before making any decisions, you should use these four factors to help you clarify your thoughts on what your organization needs and should build for: 

  • Your people 
  • Your use cases 
  • Your budget 
  • Your past/your future 

 

If you would like some support with building a modern data stack, feel free to reach out to us for a chat. 

Leave a Reply

Your email address will not be published. Required fields are marked *