Building a true enterprise data governance platform for the modern machine learning era

By Subash D’Souza, Director of Data Intelligence, Warner Bros

When Subash D’Souza, Director of Data Intelligence at Warner Bros, spoke to us at The Data Science Salon in New York City in September 2018, he claimed that data security is the unsexy side of data.  However, with cybersecurity programs popping up at universities all over the world paired with issues of online privacy’s prevalence in mainstream media, data security might be the most alluring field in existence.  D’Souza was kind enough to take us through the most important components that make up an enterprise data governance strategy, bringing to light how important data security is in building and protecting our businesses.

Hey Guys, I’m Subash de Souza. You guys have been listening to AI, machine learning, and data science talks all morning.  What I'm going to talk about is actually the unsexy side of data science, because 70 percent of the time data scientists and machine learning engineers spend is probably on the side of data governance. The reason for this is that you actually have to do a lot of data cleansing before you actually get into a form or a source or you can actually use to process your data or put it into a model.  

For me personally, one of my key daily activities is understanding how data can be used in various forms within various organizations and enterprises. That means understanding what data means from a very basic fundamental set over a period of years. Modern crisis data is never the same. Data is constantly evolving and everchanging, which causes a lot of problems. Also, the same data sources can exist in multiple different forms in multiple different sets in multiple different organizations. We have to figure out how to handle those kind of data changes. 

Some of the attributes of enterprise data governance are data architecture management, data development, data operations management, data security, internal management, data document content management and metadata management. If you're a data engineer, your data centers have to all of this. Most of you have sat for hours, trying to understand how you can get your data into a form that can actually be reused by future scientists and data consumers in general.  I'm going to talk about a each of these problems and identify how we actually accomplish this goal.   

I look at enterprise data governance as the whole feature set of how you want to do things like metadata management, data quality management, data lineage and data security. Poor data quality can cost us both money and process efficiency. For example, if you're buying data from the same data source across multiple organizations, each of these organizations have their own internal metrics of how they purchase your own data sources. There's a possibility that we possibly come by these data sources from the same entity, meaning we would be paying for the same data sources multiple times when we should be looking at consolidating these licenses into one. When we want to calculate the financials across them, we want to make sure we tie together the problem.  If there is no parent ID that ties them together, we have come up with systems to merge these datasets across multiple different platforms. As we progress, these kind of data sources keep increasing over a period of time and become a challenge when we want to bring them together. You have to guide information management decision-making to ensure information is consistently defined well, understood and trusted. This will improve the consistency of progress across the organization and ensure regulatory compliance to limit data risk.


See talks like this in person at our next Data Science Salon: APPLYING AI & MACHINE LEARNING TO FINANCE, HEALTHCARE & TECHNOLOGY, in Miami.

Register here


The key thing to understand when bringing efficiency into the board is that the way you access the data it can cause a lot of problems. We don't have one single way of accessing the data, so we access the data through multiple channels using multiple of different methods. Each of these methods bring their own challenges if you don't have a consistent pattern. From a GPD perspective, if you have a lot of GDP, then your organization is probably not impacted by what is happening with GDPR, for example. GDPR is a European law that's come out and it's been in effect since May 2018. If you're not in compliance, your organization will get hit with a 2 percent or 4 percent deduction of revenue. Google was hit with two and half billion dollars recently and continues to look at hitting more and more fines, just because of non-compliance. If you don't get rid of continuous information within a seven day period, you could be hit with a possible fine. Because of this, people are scrambling to make sure they're in compliance.

You also need to ensure that when people access the system, you don't have you to get rid of all their information. Sometimes this is very tricky because you've already aggregated the information into your system by scanning reports.  You would need to reallocate and reprocess your data and that is a huge challenge. How do you bring everything together while remaining in compliance, especially when you only have one access point to the data? Everyone is going to access the data from that single point, which could make it into a choke point. Then, people might be barred from accessing the data.  

So far, we have implemented bits and pieces of this, but we really cannot operate to scale yet.  Products today involve very specific niche aspects for data governance. Moreover, few people understand what master data management. Sometimes you will have multiple sources of data and they kind of look the same or potentially are the same because of mistakes during ID entry. This presents a problem because now, any system that uses the data cannot actually merge the two data set points. Master data management is something that you have that makes sure that these kind of problems, if they do occur, can be merged together as in the final output report. 

One study actually went through and talked to the CIOs of organizations in question to figure out what exactly the problem is. They said 52% of users don't have confidence in the information. They are concerned that when consumers see multiple reports for different results, they will be thrown off by the apparent disorganization. The consistency of outputs is very  difficult to apply across multiple different consumer sites, unless they talk to the same people. It is a very common scenario because if you don’t know if your data exists, you definitely won’t know how to access it. 42 percent of managers use wrong information at once a week. CIOs believe they can surrender the competitive advantage by using youth and managing enterprise data.  78 percent of CEOs want to improve the way users manage the data. Only 15 percent of senior executives believe that the data is currently comprehensively well-managed. 

Overall, you want to be sure that the data you have is cleansed is ready to go and perfect from the get-go. That's very difficult you can’t predict the actual way in which a consumer will process your data. The thing about data is that when data is found, it is actually created. There’s a lot of information that comes to light in live time.  You always want to know what piece of information has come in from where and you want to know if that's tied into your particular organization. For example, at Warner Brothers, we have trailers out and we want to know how many people have consumed or watched them. We have many discussions about how we can use the marketing dollars we have to make sure that hidden needs are adequately managed to produce consistent quality.  If your data is open source, you want to make sure that people know what data you have, where it is stored and how to get access to it. You can obviously say that only specific people can have access to the data or that only certain datasets will be able to be accessed. At the same time, you want to make sure that people know what data is available even if they don't get actual access to the value of the data or what the data actually means. 

The life cycle of data will only be as long as how long it is consumed. Everything from the source to the final stage is part of the metadata lifecycle. Hence, you will see that their metadata stored and created at every step of the transformation will have a summary. You can also think about this from a security perspective. Do you do want to know who actually touches a particular aspect of the data at a particular point in time? You need to be able to do this in live time, because you cannot really come back and say that a user should have access to the data or not. There's always a data lineage being created and you want to keep track of that. If you’re processing a bunch of datasets from point A to point D, you also want to know what exactly happened to the data at Point C and what data sources were consumed. If you don't know what has happened at that step, you can't really be confident that the report you have is actually right. You want to be able to put out the reader lineage for any report and say confidently that it has the right data and that overall, it was probably transformed.  Then you will know that you can continue to use the data going forward. 

No process, no matter how good the organization is, will always be accurate in what they say.  For example, we have noticed Google sometimes fails to send us the data hit online. Sometimes its only an hour delayed but sometimes it can be as much as a day delayed.  How much do we have processes in place to ensure that we can pull the data?  

All data engineering pipeline processors need to be handled. Again, these are not the most glamorous features of data engineering, but they need to be taken care of. At the end of the day, when you give these reports out, you want to make sure the consumers are highly confident of what the values give them. For example, you can set up a bunch of routines and rules to ensure that a customer’s personal data has been removed. You do data validation routines and then you finally do integrations in the end. 

I believe very strongly that you need to have a data catalog and this is something that you want to expose to all of your users. After looking at the raw data, it is up to your ETL developers and specialists to actually put the data into places where your data centers can work. Finally, you have your report developers and business analysts that will actually use these information sets, but they all come in at different data points. However, at each aspect, you need to have a data catalog available so you can make informed decisions. When you work with multiple different consumers of the data, each of them have different viewpoints  of how the data should be used and how it should be consumed. Each of them need to be aware of what the data sources are. It could be they are using Google Analytics or third-party data aggregators like Axiom.

If we look at data security, compliance, data quality and linear data, each of these add value over a period of time. All of these points make up the enterprise data governance strategy in any organization as well as the business’s value.   For example, in terms of revenue, if you consolidate your data sources into one, you're not paying into multiple channels and you can also increased business value because people know from where the data is coming. Also ,your data scientist, if they know the entire breadth of the datasets, they can start building new use cases. Typically the way it works is that one of your end users suggests a specific case they want to build against, and your data center can actually come back with ten different potential use cases because they have a comprehensive view of all the datasets out there. Then the data scientist and end consumer can work together to compile and refine the cases.  This back and forth between the end user and the data scientist wouldn’t be able to happen without a comprehensive view of what datasets are available. It only happens because of a united enterprise data governance strategy that promotes agility. 

There’s not a product out there today that solves every problem set across the board. Niche consumers’ needs are very expensive and we can’t make solutions for them that we know work without testing them out.  However, we keep training our data and growing our datasets to ensure that the training models get more and more accurate. We continue to have manual people actually sitting down and looking through these different data points to identify errors that might otherwise be overlooked. Over time, our system model will understand how to merge duplicates and correct its own errors.  It's not a marathon - it's actually baby steps. You have to start small and today, we may be at the walking stage but we know what the paragliding stage is and where we want to be. It takes a lot of effort in terms of the policies that need to be set up. And right now, not all data consumers are following the same data strategy.  

What I feel is that it will take time for any organization to get through the process and that's why it is so important to important to work with your team in order to build an enterprise data governance strategy that works for the company at large.


Curious for more?

Don’t miss the next Data Science Salon in Miami, September 10-11, 2019.

Register here


Sign up for our newsletter