As AI agents become increasingly prevalent in software applications, it's essential to ensure they're performing optimally. But how do you measure their success?

The answer lies in evaluations (evals). In this post, we'll explore the importance of evals for Large Language Model (LLM) agent applications and provide a step-by-step guide on how to build an effective eval system.

Why Evals Matter

Evals and metrics to measure your AI agent performance are crucial for LLM agent applications because they help:

Catch regressions early: Every changeset you merge has the potential to both improve and cause regressions. Evals can tell you whether it is the latter or the former.
Monitor production systems: You see user satisfaction in your product drop last week. Evals can tell you if it was caused by the AI agent, and the changeset that caused it in particular.
Make quantifiable trade-offs: Many times, a newer larger LLM can drastically improve the performance of your AI agent, but come with increased latency or GPU costs.
- A good eval system and a metric can help you make this tradeoff in a rigorous manner.
Find the weakest spots in your agent flow: Evals can tell you if you need to invest in a better RAG, or integrate more tools, or fix hallucinations, and help you craft your tech roadmap.
Boost confidence in the metric and team: You want a metric that
Correlates with user experience:
- Shows consistent improvement over a long time duration, reflecting the efforts your team has put in.
- Once you hit this sweet spot, your leaders will start trusting your metric, as well as the dev team.

There are two steps to building your evaluation system: selecting the right metric, and choosing the right dataset. Let’s walk through each of these steps.

Build your metric

The first step to a great eval system is a comprehensive metric.

The reason we need to think through a good metric is that unlike a lot of other ML domains, the ground truth with LLM AI agents is extremely fuzzy and complex, in alignment with general human interactions.

Instead of simply having binary “Good” or “bad” labels for your AI agent conversations, you need to consider multiple dimensions for your metric, which finally flow into your final label. Here is a very simplified example of three such dimensions:

You can probably think of tens of other dimensions I haven’t included in the flowchart- toxicity, tone, biased response, and so on. The dimensions you choose for evaluating your AI agent are specific to your use case: For instance, baised response would be extremely problematic for an AI realtor, while hallucinations would be particularly bad in case of a Medical AI Assistant.

If you have a data source for directly capturing user sentiment, like a survey, you can find the dimensions that correlate the most with your user’s sentiment, and use those to create your custom metric.

Build your dataset

We’ll simplify the strategy to build your eval system by discussing the three dimensions of eval systems. You can pick and choose different points along each of these dimensions, leading to evals which work for multiple situations and use cases.

We’ll walk through these dimensions using a hypothetical example of a Medical Assistant AI Agent: A chatbot which asks a user some questions about their symptoms, and follows it up with a diagnosis and treatment plan.

Level of Scaling

Due to the lack of a ground truth unlike in traditional ML, label generation can be expensive. Amongst all the techniques used for label generation, higher reliability is associated with higher costs.
However, you can scale up your labelling without sacrificing quality, as is explained below.

Expert Labelling:

Subject matter experts provide the highest reliability and are the most expensive. For instance, in the case of our Medical Assistant, these would be Doctors evaluating conversations with patients, and deciding whether they were positive or negative samples.However, you don’t want to use your experts to just give binary “good” or “bad” decisions. Given how expensive they are, it makes the most sense to utilize them to create “Guidelines” - which are very objective flow-charts that allow anyone to determine whether an interaction was ‘good’ or ‘bad’ under a metric dimension. Here’s an example breaking down a “Coherence” metric into a more granular decision chart. Remember, you can almost never make your guideline too granular:

You should also use expert labelers to provide “key points”: These are annotations for a set of ~100 conversations with your AI agent. The annotations contain points that an AI agent response should or should not contain.

For instance, a “key point” that should be present in the Medical Assistant’s response to a user question asking for fever medication would be to ask for their temperature.
Having key points will allow you to automate evals using LLM judges, as we will explore ahead.

Scaled Labelling:

Scaled labelers, while not being experts, are human labelers experienced with evaluating LLM conversations. Scaled labelers help you by taking the detailed guidelines generated by experts, and scaling the labelling process to hundreds of conversations a day. They offer less reliability but higher scalability than expert labelers, and hence are an essential component in an eval system.

LLM as a judge:

We are now in the deep end of scalability- in fact, an LLM judge is so scalable that it can be run multiple times on every single changeset to find regressions. It might be strange to use LLMs to evaluate AI agents. It is natural to be sceptical, given how prone to biases LLM judges can be. However, we can use some fixes for LLM judge unreliability. These allow us to have our cake and eat it too- have a reliable as well as a scalable Eval!

We can use the “Key points” collected from experts to make your LLM judge compare against AI agent responses.
- Instead of asking the LLM judge to simply evaluate a response, we ask it to check if the response contains all of our “key points” and there are no contradictions.

You should always try to move your eval system towards the highly scalable LLM judge stage, while using expert and scaled labelers to train and test your LLM judge. The end goal should be to have a LLM judge that is super scalable but at least as reliable as an average scaled labeler.

Type of dataset

Another dimension to consider when building an eval system is the type of dataset used. There are two main options:

Fixed Dataset:

A fixed dataset contains a list of a few hundred or so user requests.While the user requests are fixed, your AI agent’s response will change over time, with updates to code and surrounding tools. We can then evaluate these changing responses to ensure that our metrics are indeed improving over time. A fixed dataset is helpful for plotting your metrics over a long time duration, protecting against data drift in production usage. A major advantage of a fixed dataset is that it allows us to use “key points”, and consequently use LLM judges for scalable evaluation.

Live dataset:

Sampled production data involves sampling conversations from real-world usage. This type of dataset is useful for evaluating an LLM's performance in the current usage distribution.

Granularity

The final dimension to consider when building an eval system is granularity. This refers to the level of detail at which the eval system evaluates the LLM's performance. There are two main options:

Component-Level:

A typical AI agent can consist of dozens of nodes. A very simple example below contains nodes for RAG, tool usage, integrity, summarization etc.
Logging the input and output of each stage allows you to evaluate each component and find weak spots for performance.
Component level evaluation can also lead to far more objective metrics - e.g. precision/ recall for a RAG component- which is a welcome change away from the fuzzy world of LLM evals.

For instance, in the example above, we can create a dataset for our RAG stage that contains queries, and articles that should have been retrieved. Once we have this dataset, it would be easy to compute precision/recall against this ground truth. A common technique to create a RAG eval dataset is to generate synthetic queries using LLMs based on specific articles, to create a query and article tuple in your dataset.

Single-turn:

This dataset contains a user query and an AI agent response. Consequently, this is the most straightforward to create. A single turn dataset is great for some metrics like toxicity, which don’t need to take the entire conversation under review.

Conversation-level:

Conversation-level datasets involve evaluating entire conversations between the user and the LLM. This is most aligned with user sentiment, because some metrics like “consistency” can only be measured at the conversation level. This dataset is essential for ensuring that your AI agent utilizes memory effectively and doesn’t hallucinate due to irrelevant memory, or forget relevant pieces of information. While a live conversation dataset is straightforward (simply sample conversations from live traffic), a fixed conversation level dataset poses some challenges: Your user queries can no longer be static, and need to account for changing AI agent responses in the previous turns. To create conversation level fixed datasets, a trick is to simulate users through an LLM user model, since user queries need to adapt to changing AI agent responses.

Putting it all together

Now that we've explored the different dimensions of an eval system, let's put it all together and look at some real-world examples. We'll discuss how to build a comprehensive eval system that solves various problems and provides valuable insights into your AI agent's performance.

Catching Regressions:

The highest priority problem you want to solve is catching regressions. To do this, you need an LLM judge for high scalability. You can use it with component-level fixed datasets or conversation-level datasets with an LLM user. This will enable you to quickly identify and address any regressions in your AI agent's performance.

Monitoring Releases

You want to run evals a few times a week to ensure your releases are going well. For this, you need higher reliability with lower scale. Using a benchmark dataset will enable you to compare your metrics to last week and ensure they look stable. This will give you confidence that your releases are not introducing any new issues.

Reporting and Goal Setting

For reporting and goal setting, conversation-level metrics work best as they tend to be the most representative of user experience. These metrics will provide valuable insights into your AI agent's overall performance and help you set realistic goals for improvement.

Roadmap Creation and Opportunity Finding

Finally, to create your roadmap and find opportunities, component-level data sampled from production is the solution. This type of data will provide detailed insights into specific areas of your AI agent's performance and help you identify areas where improvements can be made.

By considering these three dimensions and building an effective eval system, you can supercharge your AI agents and ensure they're performing optimally. Remember, evals are crucial for catching regressions, monitoring production systems, making quantifiable trade-offs, finding weaknesses, and boosting confidence.

Supercharge Your AI Agents: The Power of Evaluations