When I joined a B2B SaaS startup in 2021, I was transitioning from a bigger and more mature company. In my previous company, I was working on a particular area of the product – while it was rigorous and intense – I was excited about the challenge of taking on a bigger role with an opportunity to lead data for sales, marketing, customer success, product, and even people in the new company.
However, what I did not anticipate was the lack of resources when you join a startup. We were a team of five people (in the first six months of me joining). A small, lean team trying to explore data, build dashboards, and derive insights to set the direction of the product at the same time. I have always taken pride in building a partner team rather than a support team, but with so much unknown and so much discovery to be done with the data, with people coming in and out, it became difficult to understand the actual source of data.
The more time my team and I spent on data discovery, the more we realized that we weren’t spending time on prescriptive and predictive analytics that we should have been doing, and we were just stuck in the descriptive loop. I realized hiring more people was not the solution, as they would join the chaos without working on a scalable solution. I realized we were asking the same repeated questions about data definitions and data sources again and again, and upon research, we came to the conclusion that we were missing a data catalog tool. I had used Collibra in the past, but it was a heavy manual lift and did not fit our needs at the time. So, we started looking for a modern solution – a modern data catalog.
Modern Tools, but Old Problem
If you walk into any modern data company, you will see a data ingestion tool that ingests the data to your warehouse, some BI tool that maintains multiple approved and exploratory dashboards, maybe a reverse ETL tool that sends the data back to your CRM tool. While modern solutions exist, and each of these tools does a wonderful job, the complexity still remains high, especially if you are trying to tie everything together.
Where the data is coming from and how it is manipulated remains a question mark that gives data folks sleepless nights. Data dictionary, definitions, and lineage remain trapped inside YAML files and Git pull requests. While data engineers understand this, most of the knowledge lives in their heads, or the documentation is too complex for the rest of the stakeholders, and sometimes the stakeholders are just not motivated enough to go through the documentation to understand the details.
Choosing the Right Tool
There were multiple options available for a data cataloging tool, and we had the following criteria to select the tool. Although tools market a number of fancy features, we just needed some basic features that “actually” resolved our problem.
Automatic integration with dbt and the warehouse
Accurate column-level lineage
Searching capability, be it at a column level or table level
Useful metadata
There were multiple tools that fit the bill to a certain extent. But we wanted to select a tool that sits on top of our warehouse, is smart enough to automatically understand the lineage, and not only helps us understand data from the source but also helps us understand which dashboards are consuming these fields, so a stakeholder consuming the dashboard knows the business logic behind the field they are referring to.
Cultural shift
We did not want it to be a tool that was just another modern solution, but one no one adopts. So, we enforced everyone on the team to use it. And it was an instant hit. In our internal team slack channels, when someone asks a question about data origin, business definition, etc., be it an analyst or a data engineer. Instead of answering the question directly, we would send a link and let them discover if the tool makes it clear enough. If there is any information missing regarding the business logic, we would update the logic.
We didn’t realize that while we were working on it, we were actually building a solid business glossary. Once we became confident that the tool was working as expected and questions were being easily answered, we started exposing the tool to other stakeholders and sharing the link to the tool in our slack channels. This allowed us to repeat the same practice we started in our team, externally as well. This again was an instant hit. Some PMs started updating the definitions and business logic and even offering to collaborate to refine the business glossary. We also started embedding the Select Star links in our dashboards, making a full circle when it comes to our ecosystem.
Gradually, repeated questions coming to our team have declined, and the focus has started shifting towards interpreting the data rather than extracting the correct data. Overall, the confidence in the back-end data and the confidence in our team got a significant boost.
What we learned
If you consider a catalog as a governance tool, it may not work as a company-wide solution. However, if you consider it as a productivity tool, it may work wonders for your team as it did for me.
Second, we were patient with it but very aggressive. We made it mandatory for internal teams to use it, even though it was a bit of a habit change and we had some initial friction. We were easily able to incorporate it in our workflow and improve the quality of our lives. We were able to focus on things that matter for modern data teams.
Third, we don’t need every stakeholder to adopt your proposal. We just need a few champions that will take the tool forward. Lastly, trust is a big factor when it comes to a data team. Going from a team that is under a lot of pressure to a team that is setting the direction of the product requires a lot of effort and, in our case, some diligence and the selection of the right tool helped us.
Final thought
There are many data catalog tools that exist, and AI has given more wings to these tools. Modern catalog tools are lightweight, easy to use, and seamlessly integrate with your data ecosystem. Documentation is a big challenge for data teams, and not something that most teams enjoy. Neither do the stakeholders, who complain about lacking documentation but are sometimes too lazy to read through provided documentation.
So, data catalogs are a solution that can ease your life, automate finding answers, help you build a business glossary, and at the same time help mature your data teams from being descriptive, reactive teams to proactive teams working on diagnostic, prescriptive, and predictive analysis that sets the direction for the stakeholders.
Author: Snehal Karanjkar