Cybersecurity and AI are both complex enough on their own, but what are the possibilities and challenges when these two fields collide?
With data continuously growing, a lack of talent, and increasing threats from digital attackers, the need for AI and machine learning in the cybersecurity space is more prevalent than ever.
We had the opportunity to speak with an expert in the field at a previous DSS Break liveshow – Nirmal Budhathoki, Senior Data Scientist at VMware Carbon Black, and asked him about his opinion and predictions on the impacts of AI in cybersecurity.
Nirmal: When I started as a data analyst, that's when I started seeing the value of burning down resources from the cyber command like cybersecurity analysts. The threat hunters are tired of dealing with alerts and exponential growth, because of the digitization of a lot of these devices. The network is always getting bigger instead of smaller and there’s a lot of alerts that come with it, so it has become overwhelming. The only option is the automated way of having an AI or machine learning system to learn from the data, learn from the patterns, and then try to help them out. That's the only hope, so the uniqueness itself is because every network is different. All these companies have their own setup of VPN’s or whatever network they established. They are also in different clouds now, and multiple of them because of this cloud agnostic kind of mentality which means you're opening up more threat space. The lack of talent obviously is another point. There needs to be some sort of combination from the people that have threat hunting or cybersecurity experience and the people that have data science experience, and figuring out a way to combine this. The intersection is unique in itself and that’s why I find the security and data science combination so unique.
Nirmal: Like I said before, the data is growing at an exponential rate. This includes video data, every post that people publish on social media – growth rate is so high as compared to before. Also because of Covid-19, again, a lot of companies decided to go remote and that’s generating even more data. Obviously the risk factor also increases with that. And then the IoT’s, that's another factor. Cryptos and blockchains, people are in that, so that increases the data surface huge, and the attack vectors are increasing daily. Like I said, the SOC analyst or the security analysts are not enough by themselves to deal with this. And then the cyber war now is not going to be a manual war, it's going to be a digital war. So when the adversaries have the advanced label tools, like AI tools to attack or hack the systems, this puts humans in the defense, so we lost already. The AI or automation is going to be the rescue or should I say the lightsaber from Star Wars. The AI is what we can give to our knights to fight.
Nirmal: It’s challenging for data scientists especially, because most of the models we learn from are in other domains, and there's plenty of data sources available on the public internet. But obviously for companies, it's sensitive data. Your network data is sensitive. You don't want to share that data. So you don't see many of the cybersecurity related competitions, as compared to others. You don't see free publicly available data sources that are real instead of simulated. So that kind of handcuffs the data scientists, they can only learn from limited resources or data sets. When they land a job, obviously they can learn from the job, but when you are ramping up your skill set, it's hard. Even most of the problems we have to deal with are the unsupervised ways of learning. So it's hard to find labels, especially in cyber data. Not to mention the class imbalance, which is huge just like in any other problems, but especially in the anomaly detection problems. Any of the classification problems for the cyber, usually becomes a huge class imbalance problem because there's only a few bad things going on, and the good data is huge compared to that. The other challenge is, I usually put it this way, the security team has to get it right every single time. However the offense, whoever is attacking, they have to get it right only one time. If they get it right one time they're in the network while on the other side we have to get it right every single time. It's hard in itself.
Nirmal: Some of the use cases I can think of would be the anomaly detection that I mentioned earlier. You want to see what the deviation is from the normal baseline. For example, someone is exfiltrating data. As a regular user my average data download is like 50mb per day for example, and all of a sudden I see a spike of a thousand gigs. Something is definitely going on, are my credentials getting hacked? That’s like a deviation from normal behavior, that's an anomaly detection problem. The other one is usually the alert fatigue for the SOC analysts. We can prioritize a list for them, we can remove some of the false positives, and some of the binary classification problems like determining phishing emails, malware versus benign files. We can also provide recommendations for security tools, if we see a similar network we can recommend the good security tools to them if they are doing something wrong or if the configuration is bad. Another industry that is on the rise, which is an old concept but becoming hot right now, is the user entity behavior analysis. Also called UEBA. That's monitoring the users’ behavior, so it is deviation from the baseline just like anomaly detection but UEBA kind of combines everything in one place; so my behavior can be a combination of many things and many factors. Those are kind of the top use cases or problem spaces.
Nirmal: Being in the cyber industry for quite some time I’ve seen a shift, at least compared to before. One reason could be that companies don't have a choice now. Covid-19 has definitely given a different view for it because people are working remotely with their laptop in hand instead of going to the office connected to their network. That increases the threat space in itself. So companies have to consider these pandemic/remote factors when calculating their cyber rigs. I usually give this analogy or example that investment in security is obviously very hard to see the ROI until something happens, like having millions of data loss. Until that point you did not know how valuable your data was. It’s like we buy our insurance thinking that we may end up in an accident but if you don't end up in an accident, then you're just spending money. But the counter fact is if it happens then obviously you end up paying more money than originally. Data is an asset and companies have realized they need to protect it. The train of attacks is increasing. Ransomware is the new train, they're encrypting the data and asking for money. A lot of things don't even come out in the news. The big ones like SolarWinds recent colonial pipeline. The recent one is Log4j, everyone is trying to fix it. I would think that there's a new paradigm, that companies are thinking from a cybersecurity perspective now which is good. I think that it will stay like that, we have no choice now.
Are you interested in joining the AI conversation with industry leaders and senior practitioners? Check out the Data Science Salon events 2023, early bird rates are now available.