Paradigm Shifts of Eval in the Age of LLMs
LLMs requires some subtle, conceptually simple, yet important changes in the way we think about evaluation
I’ve been building evaluation for ML systems throughout my career. As head of data science at Quora, we built eval for feed ranking, ads, content moderation, etc. My team at Waymo built eval for self-driving cars. Most recently, at our fintech startup Coverbase, we use LLMs to ease the pain of third-party risk management. Drawing from these experiences, I’ve come to recognize that LLMs requires some subtle, conceptually simple, yet important changes in the way we think about evaluation.
The goal of this blog post is not to offer specific eval techniques to your LLM application, but rather to suggest these 3 paradigm shifts:
- Evaluation is the cake, no longer the icing.
- Benchmark the difference.
- Embrace human triage as an integral part of eval.
I should caveat that my discussion is focused on LLM applications, not foundational model development. Also, despite the title, much of what I discuss here is applicable to other generative systems (inspired by my experience in autonomous vehicles), not just LLM applications.
1. Evaluation is the cake, no longer the icing.
Evaluation has always been important in ML development, LLM or not. But I’d argue that it is extra important in LLM development for two reasons:
a) The relative importance of eval goes up, because there are lower degrees of freedom in building LLM applications, making time spent non-eval work go down. In LLM development, building on top of foundational models such as OpenAI’s GPT or Anthropic’s Claude models, there are fewer knobs available to tweak in the application layer. And these knobs are much faster to tweak (caveat: faster to tweak, not necessarily faster to get it right). For example, changing the prompt is arguably much faster to implement than writing a new hand-crafted feature for a Gradient-Boosted Decision Tree. Thus, there is less non-eval work to do, making the proportion of time spent on eval go up.
b) The absolute importance of eval goes up, because there are higher degrees of freedom in the output of generative AI, making eval a more complex task. In contrast with classification or ranking tasks, generative AI tasks (e.g. write an essay about X, make an image of Y, generate a trajectory for an autonomous vehicle) can have an infinite number of acceptable outputs. Thus, the measurement is a process of projecting a high-dimensional space into lower dimensions. For example, for an LLM task, one can measure: “Is output text factual?”, “Does the output contain harmful content?”, “Is the language concise?”, “Does it start with ‘certainly!’ too often?”, etc. If precision and recall in a binary classification task are loss-less measurements of those binary outputs (measuring what you see), the example metrics I listed earlier for an LLM task are lossy measurements of the output text (measuring a low-dimensional representation of what you see). And that is much harder to get right.
This paradigm shift has practical implications on team sizing and hiring when staffing a project on LLM application.
2. Benchmark the difference.
This is the dream scenario: we climb on a target metric and keep improving on it.
The reality?
You can barely draw more than 2 consecutive points in the graph!
These might sound familiar to you:
After the 1st launch, we acquired a much bigger dataset, so the new metric number is no longer an apple-to-apple comparison with the old number. And we can’t re-run the old model on the new dataset — maybe other parts of the system have upgraded and we can’t check out the old commit to reproduce the old model; maybe the eval metric is an LLM-as-a-judge and the dataset is huge, so each eval run is prohibitively expensive, etc.
After the 2nd launch, we decided to change the output schema. For example, previously, we instructed the model to output a yes / no answer; now we instruct the model to output yes / no / maybe / I don’t know. So the previously carefully curated ground truth set is no longer valid.
After the 3rd launch, we decided to break the single LLM calls into a composite of two calls, and we need to evaluate the sub-component. We need new datasets for sub-component eval.
….
The point is the development cycle in the age of LLMs is often too fast for longitudinal tracking of the same metric.
So what is the solution?
Measure the delta.
In other words, make peace with having just two consecutive points on that graph. The idea is to make sure each model version is better than the previous version (to the best of your knowledge at that point in time), even though it is quite hard to know where its performance stands in absolute terms.
Suppose I have an LLM-based language tutor that first classifies the input as English or Spanish, and then offers grammar tips. A simple metric can be the accuracy of the “English / Spanish” label. Now, say I made some changes to the prompt and want to know whether the new prompt improves accuracy. Instead of hand-labeling a large data set and computing accuracy on it, another way is to just focus on the data points where the old and new prompts produce different labels. I won’t be able to know the absolute accuracy of either model this way, but I will know which model has higher accuracy.
I should clarify that I am not saying benchmarking the absolute has no merits. I am only saying we should be cognizant of the cost of doing so, and benchmarking the delta — albeit not a full substitute — can be a much more cost-effective way to get a directional conclusion. One of the more fundamental reasons for this paradigm shift is that if you are building your ML model from scratch, you often have to curate a large training set anyway, so the eval dataset can often be a byproduct of that. This is not the case with zero-shot and few-shots learning on pre-trained models (such as LLMs).
As a second example, perhaps I have an LLM-based metric: we use a separate LLM to judge whether the explanation produced in my LLM language tutor is clear enough. One might ask, “Since the eval is automated now, is benchmarking the delta still cheaper than benchmarking the absolute?” Yes. Because the metric is more complicated now, you can keep improving the metric itself (e.g. prompt engineering the LLM-based metric). For one, we still need to eval the eval; benchmarking the deltas tells you whether the new metric version is better. For another, as the LLM-based metric evolves, we don’t have to sweat over backfilling benchmark results of all the old versions of the LLM language tutor with the new LLM-based metric version, if we only focus on comparing two adjacent versions of the LLM language tutor models.
Benchmarking the deltas can be an effective inner-loop, fast-iteration mechanism, while saving the more expensive way of benchmarking the absolute or longitudinal tracking for the outer-loop, lower-cadence iterations.
3. Embrace human triage as an integral part of eval.
As discussed above, the dream of carefully triaging a golden set once-and-for-all such that it can be used as an evergreen benchmark can be unattainable. Triaging will be an integral, continuous part of the development process, whether it is triaging the LLM output directly, or triaging those LLM-as-judges or other kinds of more complex metrics. We should continue to make eval as scalable as possible; the point here is that despite that, we should not expect the elimination of human triage. The sooner we come to terms with this, the sooner we can make the right investments in tooling.
As such, whatever eval tools we use, in-house or not, there should be an easy interface for human triage. A simple interface can look like the following. Combined with the point earlier on benchmarking the difference, it has a side-by-side panel, and you can easily flip through the results. It also should allow you to easily record your triaged notes such that they can be recycled as golden labels for future benchmarking (and hence reduce future triage load).
A more advanced version ideally would be a blind test, where it is unknown to the triager which side is which. We’ve repeatedly confirmed with data that when not doing blind testing, developers, even with the best intentions, have subconscious bias, favoring the version they developed.
These three paradigm shifts, once spotted, are fairly straightforward to adapt to. The challenge isn’t in the complexity of the solutions, but in recognizing them upfront amidst the excitement and rapid pace of development. I hope sharing these reflections helps others who are navigating similar challenges in their own work.
Paradigm Shifts of Eval in the Age of LLM was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Paradigm Shifts of Eval in the Age of LLMsLLMs requires some subtle, conceptually simple, yet important changes in the way we think about evaluationI’ve been building evaluation for ML systems throughout my career. As head of data science at Quora, we built eval for feed ranking, ads, content moderation, etc. My team at Waymo built eval for self-driving cars. Most recently, at our fintech startup Coverbase, we use LLMs to ease the pain of third-party risk management. Drawing from these experiences, I’ve come to recognize that LLMs requires some subtle, conceptually simple, yet important changes in the way we think about evaluation.The goal of this blog post is not to offer specific eval techniques to your LLM application, but rather to suggest these 3 paradigm shifts:Evaluation is the cake, no longer the icing.Benchmark the difference.Embrace human triage as an integral part of eval.I should caveat that my discussion is focused on LLM applications, not foundational model development. Also, despite the title, much of what I discuss here is applicable to other generative systems (inspired by my experience in autonomous vehicles), not just LLM applications.1. Evaluation is the cake, no longer the icing.Evaluation has always been important in ML development, LLM or not. But I’d argue that it is extra important in LLM development for two reasons:a) The relative importance of eval goes up, because there are lower degrees of freedom in building LLM applications, making time spent non-eval work go down. In LLM development, building on top of foundational models such as OpenAI’s GPT or Anthropic’s Claude models, there are fewer knobs available to tweak in the application layer. And these knobs are much faster to tweak (caveat: faster to tweak, not necessarily faster to get it right). For example, changing the prompt is arguably much faster to implement than writing a new hand-crafted feature for a Gradient-Boosted Decision Tree. Thus, there is less non-eval work to do, making the proportion of time spent on eval go up.b) The absolute importance of eval goes up, because there are higher degrees of freedom in the output of generative AI, making eval a more complex task. In contrast with classification or ranking tasks, generative AI tasks (e.g. write an essay about X, make an image of Y, generate a trajectory for an autonomous vehicle) can have an infinite number of acceptable outputs. Thus, the measurement is a process of projecting a high-dimensional space into lower dimensions. For example, for an LLM task, one can measure: “Is output text factual?”, “Does the output contain harmful content?”, “Is the language concise?”, “Does it start with ‘certainly!’ too often?”, etc. If precision and recall in a binary classification task are loss-less measurements of those binary outputs (measuring what you see), the example metrics I listed earlier for an LLM task are lossy measurements of the output text (measuring a low-dimensional representation of what you see). And that is much harder to get right.This paradigm shift has practical implications on team sizing and hiring when staffing a project on LLM application.2. Benchmark the difference.This is the dream scenario: we climb on a target metric and keep improving on it.The reality?You can barely draw more than 2 consecutive points in the graph!These might sound familiar to you:After the 1st launch, we acquired a much bigger dataset, so the new metric number is no longer an apple-to-apple comparison with the old number. And we can’t re-run the old model on the new dataset — maybe other parts of the system have upgraded and we can’t check out the old commit to reproduce the old model; maybe the eval metric is an LLM-as-a-judge and the dataset is huge, so each eval run is prohibitively expensive, etc.After the 2nd launch, we decided to change the output schema. For example, previously, we instructed the model to output a yes / no answer; now we instruct the model to output yes / no / maybe / I don’t know. So the previously carefully curated ground truth set is no longer valid.After the 3rd launch, we decided to break the single LLM calls into a composite of two calls, and we need to evaluate the sub-component. We need new datasets for sub-component eval.….The point is the development cycle in the age of LLMs is often too fast for longitudinal tracking of the same metric.So what is the solution?Measure the delta.In other words, make peace with having just two consecutive points on that graph. The idea is to make sure each model version is better than the previous version (to the best of your knowledge at that point in time), even though it is quite hard to know where its performance stands in absolute terms.Suppose I have an LLM-based language tutor that first classifies the input as English or Spanish, and then offers grammar tips. A simple metric can be the accuracy of the “English / Spanish” label. Now, say I made some changes to the prompt and want to know whether the new prompt improves accuracy. Instead of hand-labeling a large data set and computing accuracy on it, another way is to just focus on the data points where the old and new prompts produce different labels. I won’t be able to know the absolute accuracy of either model this way, but I will know which model has higher accuracy.I should clarify that I am not saying benchmarking the absolute has no merits. I am only saying we should be cognizant of the cost of doing so, and benchmarking the delta — albeit not a full substitute — can be a much more cost-effective way to get a directional conclusion. One of the more fundamental reasons for this paradigm shift is that if you are building your ML model from scratch, you often have to curate a large training set anyway, so the eval dataset can often be a byproduct of that. This is not the case with zero-shot and few-shots learning on pre-trained models (such as LLMs).As a second example, perhaps I have an LLM-based metric: we use a separate LLM to judge whether the explanation produced in my LLM language tutor is clear enough. One might ask, “Since the eval is automated now, is benchmarking the delta still cheaper than benchmarking the absolute?” Yes. Because the metric is more complicated now, you can keep improving the metric itself (e.g. prompt engineering the LLM-based metric). For one, we still need to eval the eval; benchmarking the deltas tells you whether the new metric version is better. For another, as the LLM-based metric evolves, we don’t have to sweat over backfilling benchmark results of all the old versions of the LLM language tutor with the new LLM-based metric version, if we only focus on comparing two adjacent versions of the LLM language tutor models.Benchmarking the deltas can be an effective inner-loop, fast-iteration mechanism, while saving the more expensive way of benchmarking the absolute or longitudinal tracking for the outer-loop, lower-cadence iterations.3. Embrace human triage as an integral part of eval.As discussed above, the dream of carefully triaging a golden set once-and-for-all such that it can be used as an evergreen benchmark can be unattainable. Triaging will be an integral, continuous part of the development process, whether it is triaging the LLM output directly, or triaging those LLM-as-judges or other kinds of more complex metrics. We should continue to make eval as scalable as possible; the point here is that despite that, we should not expect the elimination of human triage. The sooner we come to terms with this, the sooner we can make the right investments in tooling.As such, whatever eval tools we use, in-house or not, there should be an easy interface for human triage. A simple interface can look like the following. Combined with the point earlier on benchmarking the difference, it has a side-by-side panel, and you can easily flip through the results. It also should allow you to easily record your triaged notes such that they can be recycled as golden labels for future benchmarking (and hence reduce future triage load).A more advanced version ideally would be a blind test, where it is unknown to the triager which side is which. We’ve repeatedly confirmed with data that when not doing blind testing, developers, even with the best intentions, have subconscious bias, favoring the version they developed.These three paradigm shifts, once spotted, are fairly straightforward to adapt to. The challenge isn’t in the complexity of the solutions, but in recognizing them upfront amidst the excitement and rapid pace of development. I hope sharing these reflections helps others who are navigating similar challenges in their own work.Paradigm Shifts of Eval in the Age of LLM was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story. editors-pick, llm-evaluation, generative-ai-tools, machine-learning, llm Towards Data Science – MediumRead More
Add to favorites
0 Comments