Experimental Unsupervised Log Anomaly Detection

Unsupervised log anomaly detection is a technique to detect anomalies in logs and can be more effective than supervised log anomaly detection which requires a lot of labeled data.

The idea is to be able to do this from the command line interface using a pip install that I created called SWTK (SoftWater Toolkit).

Unsupervised Log Anomaly Detection

After reading this article, you’ll:

Learn about a new log anomaly detection approach
Understand why the current approaches may not be the best
Understand a cybersecurity-specific use case for this technique
Learn about SWTK
Understand how the tool can be used
Learn about the potential drawbacks of the toolkit
Learn about potential future research opportunities

Unsupervised Log Anomaly Detection is a technique that can be used to detect anomalies in logs. This has the potential to become a very powerful technique in Cybersecurity as well DevOps in general. This technique is also used in other domains like IoT where logs are generated. Forward-thinking engineers are already using this technique to detect anomalies in logs generated by the sensors placed on their equipment.

The idea is that we can learn what a normal log looks like from the same set of logs that contains anomalous logs. I mean, that’s what I do when I want to understand why something happened. So why not try asking an AI to do the same?

We can have the AI learn what a log for the system is supposed to look like from the first line of the log and learn along the way. We can then ask the model to tell us where the weirdest lines in the log appear.

Problem with Supervised Log Anomaly Detection

Generally speaking, Log Anomaly Detection is considered to be a supervised learning problem. This makes sense, logs are something you should have a lot of. Especially if you are a large company or a government agency. You should have a lot of logs to train your model on.

The issue is that people misunderstand the term supervised learning. Supervised learning is not just about having a lot of data. It is about having a lot of data that is labeled. Ideally speaking, you would have MLOps teamwork from within the DevOps team to automatically label the logs post manual analysis done by the analysts. This allows you to create a large dataset that is labeled with labels that you care about. From there the team can decide which algorithms to use to create the different models. The models can then be used to detect anomalies in the logs. This is the ideal scenario.

But as far as the job pool is concerned, an ML engineer is a mythical creature. All the best ML engineers are either working on their own projects or are getting paid absurdly on projects that are far more important than dealing with logs.

In the case of Log Anomaly Detection, you would need to have a lot of logs that are labeled as normal and a lot of logs that are labeled as anomalous. This is not always possible. Especially if you are a small company or a startup. You might have a lot of logs but you might not have a lot of logs that are labeled as anomalous.

Incidence Response use case for Unsupervised Log Anomaly Detection

Part of my duties in SecOps involves Incidence Response (IR). Basically, if a company gets hacked, a capable CISO calls people for IR. Trust me when I say this, this is the worst environment to learn about someone's digital environment. The people responsible, usually the CTO is simply having the worst day of their life. Often my priority is to make sure the person I am working with does not break down while sharing information with me. In some cases, the best I can get is just a 2 million line dump of their web server's logs. Don't get me wrong, if I had the time to set up a SIEM using ELK stack or a cloud SIEM through your favorite CSP (Cloud Service Provider), I would be able to instantly identify the lines that lead to the compromise without using AI. However, time is not an abundant resource during IR.

This is where the allure of Unsupervised Learning comes into play. If you want to build an AI, you would have to choose how the AI is going to learn. You can either choose to have the AI learn from a dataset that is labeled or you can choose to have the AI learn from a dataset that is not labeled. In the case of Log Anomaly Detection, you might not have a lot of labeled data. This is where unsupervised learning comes in handy. You can use unsupervised learning to create a model that can detect anomalies in logs as it sees them. It is learning what the log is generally supposed to look like and what it is not supposed to look like. This is the best you can do in the case of IR. Remember, this is a tool, it is not meant to replace an IR team. It is meant to help an IR team.

So how do I wield this magic?

I created a PIP package named SWTK (SoftWater Toolkit) for unsupervised log anomaly detection from the command line. You can install it through pip install SWTK and then run it through SWTK -i sample.txt. The sample.txt file is a sample log file that I have included in the GitHub repo. The code is written partly in python and partly in Rust, it uses a Trie data structure to create the model. The Trie data structure is a data structure that is used to store strings. It is very efficient at storing strings and it is very efficient at searching for strings. The Trie algorithm is coded in Rust while the wrapper is coded in Python. This allows for maximum ease of access as well as maximum computational speed.

How to use SWTK

sudo pip install SWTK

SWTK -h

To check if the pip was correctly

This is the file I will be running the test on. It is a synthetic Web server log.

SWTK -i input.txt -n 5

This command will return us the 5 most anomalous lines from this text file.

You may notice some extra setup happens when you run this for the first time. Try running a different command this time and you should not get the extra setup stuff.

SWTK -i input.txt -n 5 -v yes

This command will also output the anomaly score the program assigned to each line.

Now let's say you want to save the model so you can use it again later.

SWTK -i input.txt -s save.model

Now you want to use this model on a different file and find the weirdest 5 lines. With SWTK, that’s easy.

SWTK -i sample.txt -m save.model -n 5 -v yes

If you want to update the model instead of just using it this is how you do this through SWTK:

SWTK -i sample.txt -m save.model -s save.model

Disclaimer

This is an experimental project that is still not ready for commercial or mission-critical applications. I am a Cyber and MLOps professional, as such, I do not write production code.

As you'll see in the code base if you want at github.com/nileshkhetrapal/SWTK I have made documentation and readability a priority. I have also made the code as modular as possible so that it can be used in other projects.

Future research

The first thing to do is to test this algorithm against a popular dataset like MNIST or IDS. We will be able to better understand the efficacy of this tool then.

I want to study how the extent AI can be used in its current state today in the field of CyberSecurity. I want to build an AI malware analyst. The issue is that such a project would require sponsorship from a CSP that provides services like Virtual Machines and Machine Learning. It would be interesting to see how well an AI malware analyst compares to a real malware developer in a competition like this one.