July 25, 2024

How to stop worrying and love the data

Generated by the author using Midjourney Version 6

Definition: eval (short for evaluation). A critical phase in a model’s development lifecycle. The process that helps a team understand if an AI model is actually doing what they want it to. The evaluation process applies to all types of models from basic classifiers to LLMs like ChatGPT. The term eval is also used to refer to the dataset or list of test cases used in the evaluation.

Depending on the model, an eval may involve quantitative, qualitative, human-led assessments, or all of the above. Most evals I’ve encountered in my career involved running the model on a curated dataset to calculate key metrics of interest, like accuracy, precision and recall.

Perhaps because historically evals involved large spreadsheets or databases of numbers, most teams today leave the responsibility of designing and running an eval entirely up to the model developers.

However, I believe in most cases evals should be heavily defined by the product manager.

Image by the author using Midjourney Version 6

Evals aim to answer questions like:

Is this model accomplishing its goal?Is this model better than other available models?How will this model impact the user experience?Is this model ready to be launched in production? If not, what needs work?

Especially for any user-facing models, no one is in a better position than the PM to consider the impact to the user experience and ensure the key user journeys are reflected in the test plan. No one understands the user better than the PM, right?

It’s also the PM’s job to set the goals for the product. It follows that the goal of a model deployed in a product should be closely aligned with the product vision.

But how should you think about setting a “goal” for a model? The short answer is, it depends on what kind of model you are building.

Eval Objectives: One Size Doesn’t Fit All

Setting a goal for a model is a crucial first step before you can design an effective eval. Once we have that, we can ensure we are covering the full range of inputs with our eval composition. Consider the following examples.

Classification

Example model: Classifying emails as spam or not spam.Product goal: Keep users safe from harm and ensure they can always trust the email service to be a reliable and efficient way to manage all other email communications.Model goal: Identify as many spam emails as possible while minimizing the number of non-spam emails that are mislabeled as spam.Goal → eval translation: We want to recreate the corpus of emails the classifier will encounter with our users in our test. We need to make sure to include human-written emails, common spam and phishing emails, and more ambiguous shady marketing emails. Don’t rely exclusively on user labels for your spam labels. Users make mistakes (like thinking a real invitation to be in a Drake music video was spam), and including them will train the model to make these mistakes too.Eval composition: A list of example emails including legitimate communications, newsletters, promotions, and a range of spam types like phishing, ads, and malicious content. Each example will have a “true” label (i.e., “is spam”) and a predicted label generated during the evaluation. You may also have additional context from the model like a “probability spam” numerical score.

Text Generation — Task Assistance

Example model: A customer service chatbot for tax return preparation software.Product goal: Reduce the amount of time it takes users to fill out and submit their tax return by providing quick answers to the most common support questions.Model goal: Generate accurate answers for questions about the most common scenarios users encounter. Never give incorrect advice. If there is any doubt about the correct response, route the query to a human agent or a help page.Goal → eval translation: Simulate the range of questions the chatbot is likely to receive, especially the most common, the most challenging, and the most problematic where a bad answer is disastrous for the user or company.Eval composition: a list of queries (ex: “Can I deduct my home office expenses?”), and ideal responses (e.g., from FAQs and experienced customer support agents). When the chatbot shouldn’t give an answer and/or should escalate to an agent specify this outcome. The queries should cover a range of topics with varying levels of complexities, user emotions, and edge cases. Problematic examples might include “will the government notice if I don’t mention this income?” and “how much longer do you think I will have to keep paying for my father’s home care?”

Recommendation

Example model: Recommendations of baby and toddler products for parents.Product goal: Simplify essential shopping for families with young children by suggesting stage-appropriate products that evolve to reflect changing needs as their child grows up.Model goal: Identify the highest relevance products customers are most likely to buy based on what we know about them.Goal → eval translation: Try to get a preview of what users will be seeing on day one when the model launches, considering both the most common user experiences, edge cases and try to anticipate any examples where something could go horribly wrong (like recommending dangerous or illegal products under the banner “for your little one”).Evals composition: For an offline sense check you want to have a human review the results to see if they are reasonable. The examples could be a list of 100 diverse customer profiles and purchase histories, paired with the top 10 recommended products for each. For your online evaluation, an A/B test will allow you to compare the model’s performance to a simple heuristic (like recommending bestsellers) or to the current model. Running an offline evaluation to predict what people will click using historical click behavior is also an option, but getting unbiased evaluation data here can be tricky if you have a large catalog. To learn more about online and offline evaluations check out this article or ask your favorite LLM.

These are of course simplified examples, and every model has product and data nuances that should be taken into account when designing an eval. If you aren’t sure where to start designing your own eval, I recommend describing the model and goals to your favorite LLM and asking for its advice.

An Eval In Action: Implications for the User Experience

Here’s a (simplified) sample of what an eval data set might look like for an email spam detection model.

Image by the author

So … where does the PM come in? And why should they be looking at the data?

Imagine the following scenario:

Model developer: “Hey PM. Our new model has 96% accuracy on the evaluation, can we ship it? The current model only got 93%.”

Bad AI PM: “96% is better than 93%. So yes, let’s ship it.”

Better AI: “That’s a great improvement! Can I look at the eval data? I’d like to understand how often critical emails are being flagged as spam, and what kind of spam is being let through.”

After spending some time with the data, the better AI PM sees that even though more spam emails are now correctly identified, enough critical emails like the job offer example above were also being incorrectly labeled as spam. They assesses how often this happened, and how many users might be impacted. They conclude that even if this only impacted 1% of users, the impact could be catastrophic, and this tradeoff isn’t worth it for fewer spam emails to make it through.

The very best AI PM goes a step further to identify gaps in the training data, like an absence of critical business communication examples. They help source additional data to reduce the rate of false positives. Where model improvements aren’t feasible, they propose changes to the UI of the product like warning users when an email “might” be spam when the model isn’t certain. This is only possible because they know the data and what real-world examples matter to users.

Remember, AI product management does not require an in-depth knowledge of model architecture. However, being comfortable looking at lots of data examples to understand a model’s impact on your users is vital. Understanding critical edge cases that might otherwise escape evaluation datasets is especially important.

Evals where PM Input Is Less Relevant

The term “eval” really is a catch all that is used differently by everyone. Not all evals are focused on details relevant to the user experience. Some evals help the dev team predict behavior in production like latency and cost. While the PM might be a stakeholder for these evals, PM co-design is not critical, and heavy PM involvement might even be a distraction for everyone.

Ultimately the PM should be in charge of making sure ALL the right evals are being developed and run by the right people. PM co-development is most important for any related to user experience.

Eval to Launch — What is Good Enough?

In traditional software engineering, it’s expected that 100% of unit tests pass before any code enters production. Alas, this is not how things work in the world of AI. Evals almost always reveal something less than ideal. So if you can never achieve 100% of what you want, how should one decide a model is ready to ship? Setting this bar with the model developers should also be part of an AI PM’s job.

The PM should determine what eval metrics indicate the model is ‘good enough’ to offer value to users with acceptable tradeoffs.

Your bar for “value” might vary. There are many instances where launching something rough early on to see how users interact with it (and start your data flywheel) can be a great strategy so long as you don’t cause any harm to the users or your brand.

Consider the customer service chatbot.

The bot will never generate answers that perfectly mirror your ideal responses. Instead, a PM could work with the model developers to develop a set of heuristics that assess closeness to ideal answers. This blog post covers some popular heuristics. There are also many open source and paid frameworks that support this part of the evaluation process, with more launching all the time.

It is also important to estimate the frequency of potentially disastrous responses that could misinform users or hurt the company (ex: offer a free flight!), and work with the model developers on improvements to minimize this frequency. This can also be a good opportunity to connect with your in-house marketing, PR, legal, and security teams.

After a launch, the PM must ensure monitoring is in place to ensure critical use cases continue to work as expected, AND that future work is directed towards improving any underperforming areas.

Similarly, no production ready spam email filter achieves 100% precision AND 100% recall (and even if it could, spam techniques will continue to evolve), but understanding where the model fails can inform product accommodations and future model investments.

Recommendation models often require many evals, including online and offline evals, before launching to 100% of users in production. If you are working on a high stakes surface, you’ll also want a post launch evaluation to look at the impact on user behavior and identify new examples for your eval set.

Good AI product management isn’t about achieving perfection. It’s about delivering the best product to your users, which requires:

Setting specific goals for how the model will impact user experience -> make sure critical use cases are reflected in the evalUnderstanding model limitations and how these impact users -> pay attention to issues the eval uncovers and what these would mean for usersMaking informed decisions about acceptable trade-offs and a plan for risk mitigation -> informed by learnings from the evaluation’s simulated behavior

Embracing evals allows product managers to understand and own the impact of the model on user experience, and effectively lead the team towards better results.

What Exactly Is an “Eval” and Why Should Product Managers Care? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

​How to stop worrying and love the dataGenerated by the author using Midjourney Version 6Definition: eval (short for evaluation). A critical phase in a model’s development lifecycle. The process that helps a team understand if an AI model is actually doing what they want it to. The evaluation process applies to all types of models from basic classifiers to LLMs like ChatGPT. The term eval is also used to refer to the dataset or list of test cases used in the evaluation.Depending on the model, an eval may involve quantitative, qualitative, human-led assessments, or all of the above. Most evals I’ve encountered in my career involved running the model on a curated dataset to calculate key metrics of interest, like accuracy, precision and recall.Perhaps because historically evals involved large spreadsheets or databases of numbers, most teams today leave the responsibility of designing and running an eval entirely up to the model developers.However, I believe in most cases evals should be heavily defined by the product manager.Image by the author using Midjourney Version 6Evals aim to answer questions like:Is this model accomplishing its goal?Is this model better than other available models?How will this model impact the user experience?Is this model ready to be launched in production? If not, what needs work?Especially for any user-facing models, no one is in a better position than the PM to consider the impact to the user experience and ensure the key user journeys are reflected in the test plan. No one understands the user better than the PM, right?It’s also the PM’s job to set the goals for the product. It follows that the goal of a model deployed in a product should be closely aligned with the product vision.But how should you think about setting a “goal” for a model? The short answer is, it depends on what kind of model you are building.Eval Objectives: One Size Doesn’t Fit AllSetting a goal for a model is a crucial first step before you can design an effective eval. Once we have that, we can ensure we are covering the full range of inputs with our eval composition. Consider the following examples.ClassificationExample model: Classifying emails as spam or not spam.Product goal: Keep users safe from harm and ensure they can always trust the email service to be a reliable and efficient way to manage all other email communications.Model goal: Identify as many spam emails as possible while minimizing the number of non-spam emails that are mislabeled as spam.Goal → eval translation: We want to recreate the corpus of emails the classifier will encounter with our users in our test. We need to make sure to include human-written emails, common spam and phishing emails, and more ambiguous shady marketing emails. Don’t rely exclusively on user labels for your spam labels. Users make mistakes (like thinking a real invitation to be in a Drake music video was spam), and including them will train the model to make these mistakes too.Eval composition: A list of example emails including legitimate communications, newsletters, promotions, and a range of spam types like phishing, ads, and malicious content. Each example will have a “true” label (i.e., “is spam”) and a predicted label generated during the evaluation. You may also have additional context from the model like a “probability spam” numerical score.Text Generation — Task AssistanceExample model: A customer service chatbot for tax return preparation software.Product goal: Reduce the amount of time it takes users to fill out and submit their tax return by providing quick answers to the most common support questions.Model goal: Generate accurate answers for questions about the most common scenarios users encounter. Never give incorrect advice. If there is any doubt about the correct response, route the query to a human agent or a help page.Goal → eval translation: Simulate the range of questions the chatbot is likely to receive, especially the most common, the most challenging, and the most problematic where a bad answer is disastrous for the user or company.Eval composition: a list of queries (ex: “Can I deduct my home office expenses?”), and ideal responses (e.g., from FAQs and experienced customer support agents). When the chatbot shouldn’t give an answer and/or should escalate to an agent specify this outcome. The queries should cover a range of topics with varying levels of complexities, user emotions, and edge cases. Problematic examples might include “will the government notice if I don’t mention this income?” and “how much longer do you think I will have to keep paying for my father’s home care?”RecommendationExample model: Recommendations of baby and toddler products for parents.Product goal: Simplify essential shopping for families with young children by suggesting stage-appropriate products that evolve to reflect changing needs as their child grows up.Model goal: Identify the highest relevance products customers are most likely to buy based on what we know about them.Goal → eval translation: Try to get a preview of what users will be seeing on day one when the model launches, considering both the most common user experiences, edge cases and try to anticipate any examples where something could go horribly wrong (like recommending dangerous or illegal products under the banner “for your little one”).Evals composition: For an offline sense check you want to have a human review the results to see if they are reasonable. The examples could be a list of 100 diverse customer profiles and purchase histories, paired with the top 10 recommended products for each. For your online evaluation, an A/B test will allow you to compare the model’s performance to a simple heuristic (like recommending bestsellers) or to the current model. Running an offline evaluation to predict what people will click using historical click behavior is also an option, but getting unbiased evaluation data here can be tricky if you have a large catalog. To learn more about online and offline evaluations check out this article or ask your favorite LLM.These are of course simplified examples, and every model has product and data nuances that should be taken into account when designing an eval. If you aren’t sure where to start designing your own eval, I recommend describing the model and goals to your favorite LLM and asking for its advice.An Eval In Action: Implications for the User ExperienceHere’s a (simplified) sample of what an eval data set might look like for an email spam detection model.Image by the authorSo … where does the PM come in? And why should they be looking at the data?Imagine the following scenario:Model developer: “Hey PM. Our new model has 96% accuracy on the evaluation, can we ship it? The current model only got 93%.”Bad AI PM: “96% is better than 93%. So yes, let’s ship it.”Better AI: “That’s a great improvement! Can I look at the eval data? I’d like to understand how often critical emails are being flagged as spam, and what kind of spam is being let through.”After spending some time with the data, the better AI PM sees that even though more spam emails are now correctly identified, enough critical emails like the job offer example above were also being incorrectly labeled as spam. They assesses how often this happened, and how many users might be impacted. They conclude that even if this only impacted 1% of users, the impact could be catastrophic, and this tradeoff isn’t worth it for fewer spam emails to make it through.The very best AI PM goes a step further to identify gaps in the training data, like an absence of critical business communication examples. They help source additional data to reduce the rate of false positives. Where model improvements aren’t feasible, they propose changes to the UI of the product like warning users when an email “might” be spam when the model isn’t certain. This is only possible because they know the data and what real-world examples matter to users.Remember, AI product management does not require an in-depth knowledge of model architecture. However, being comfortable looking at lots of data examples to understand a model’s impact on your users is vital. Understanding critical edge cases that might otherwise escape evaluation datasets is especially important.Evals where PM Input Is Less RelevantThe term “eval” really is a catch all that is used differently by everyone. Not all evals are focused on details relevant to the user experience. Some evals help the dev team predict behavior in production like latency and cost. While the PM might be a stakeholder for these evals, PM co-design is not critical, and heavy PM involvement might even be a distraction for everyone.Ultimately the PM should be in charge of making sure ALL the right evals are being developed and run by the right people. PM co-development is most important for any related to user experience.Eval to Launch — What is Good Enough?In traditional software engineering, it’s expected that 100% of unit tests pass before any code enters production. Alas, this is not how things work in the world of AI. Evals almost always reveal something less than ideal. So if you can never achieve 100% of what you want, how should one decide a model is ready to ship? Setting this bar with the model developers should also be part of an AI PM’s job.The PM should determine what eval metrics indicate the model is ‘good enough’ to offer value to users with acceptable tradeoffs.Your bar for “value” might vary. There are many instances where launching something rough early on to see how users interact with it (and start your data flywheel) can be a great strategy so long as you don’t cause any harm to the users or your brand.Consider the customer service chatbot.The bot will never generate answers that perfectly mirror your ideal responses. Instead, a PM could work with the model developers to develop a set of heuristics that assess closeness to ideal answers. This blog post covers some popular heuristics. There are also many open source and paid frameworks that support this part of the evaluation process, with more launching all the time.It is also important to estimate the frequency of potentially disastrous responses that could misinform users or hurt the company (ex: offer a free flight!), and work with the model developers on improvements to minimize this frequency. This can also be a good opportunity to connect with your in-house marketing, PR, legal, and security teams.After a launch, the PM must ensure monitoring is in place to ensure critical use cases continue to work as expected, AND that future work is directed towards improving any underperforming areas.Similarly, no production ready spam email filter achieves 100% precision AND 100% recall (and even if it could, spam techniques will continue to evolve), but understanding where the model fails can inform product accommodations and future model investments.Recommendation models often require many evals, including online and offline evals, before launching to 100% of users in production. If you are working on a high stakes surface, you’ll also want a post launch evaluation to look at the impact on user behavior and identify new examples for your eval set.Good AI product management isn’t about achieving perfection. It’s about delivering the best product to your users, which requires:Setting specific goals for how the model will impact user experience -> make sure critical use cases are reflected in the evalUnderstanding model limitations and how these impact users -> pay attention to issues the eval uncovers and what these would mean for usersMaking informed decisions about acceptable trade-offs and a plan for risk mitigation -> informed by learnings from the evaluation’s simulated behaviorEmbracing evals allows product managers to understand and own the impact of the model on user experience, and effectively lead the team towards better results.What Exactly Is an “Eval” and Why Should Product Managers Care? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.  evaluation, notes-from-industry, model-evaluation, ai-product-management, product-management Towards Data Science – MediumRead More

How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

FavoriteLoadingAdd to favorites
July 25, 2024

Recent Posts

0 Comments

Submit a Comment