Dr. Owns

January 20, 2025

How to get from PoCs to tested high-quality applications in production

Image licensed from elements.envato.com, edit by Marcel Müller, 2025

The generative AI hype has rolled through the business world in the past two years. This technology can make business process executions more efficient, reduce wait time, and reduce process defects. Some interfaces like ChatGPT make interacting with an LLM easy and accessible. Anyone with experience using a chat application can effortlessly type a query, and ChatGPT will always generate a response. Yet the quality and suitability for the intended use of your generated content may vary. This is especially true for enterprises that want to use generative AI technology in their business operations.

I have spoken to countless managers and entrepreneurs who failed in their endeavors because they could not get high-quality generative AI applications to production and get reusable results from a non-deterministic model. On the other hand, I have also built more than three dozen AI applications and have realized one common misconception when people think about quality for generative AI applications: They think it is all about how powerful your underlying model is. But this is only 30% of the full story.

But there are dozens of techniques, patterns, and architectures that help create impactful LLM-based applications of the quality that businesses desire. Different foundation models, fine-tuned models, architectures with retrieval augmented generation (RAG) and advanced processing pipelines are just the tip of the iceberg.

This article shows how we can qualitatively and quantitatively evaluate generative AI applications in the context of concrete business processes. We will not stop at generic benchmarks but introduce approaches to evaluating applications with generative AI. After a quick analysis of generative AI applications and their business processes, we will look into the following questions:

  • In what context do we need to evaluate generative AI applications to assess their end-to-end quality and utility in enterprise applications?
  • When in the development life cycle of applications with generative AI, do we use different approaches for evaluation, and what are the objectives?
  • How do we use different metrics in isolation and production to select, monitor and improve the quality of generative AI applications?

This overview will give us an end-to-end evaluation framework for generative AI applications in enterprise scenarios that I call the PEEL (performance evaluation for enterprise LLM applications). Based on the conceptual framework created in this article, we will introduce an implementation concept as an addition to the entAIngine Test Bed module as part of the entAIngine platform.

1. Background: Business Processes and Generative AI

An organization lives by its business processes. Everything in a company can be a business process, such as customer support, software development, and operations processes. Generative AI can improve our business processes by making them faster and more efficient, reducing wait time and improving the outcome quality of our processes. Yet, we can further divide each process activity that uses generative AI even more.

Processes for generative AI applications. © 2025, Marcel Müller

The illustration shows the start of a simple business that a telecommunications company’s customer support agent must go through. Every time a new customer support request comes in, the customer support agent has to give it a priority-level. When the work items on their list come to the point that the request has priority, the customer support agents must find the correct answer and write an answer email. Afterward, they need to send the email to the customers and wait for a reply, and they iterate until the request is solved.

We can use a generative AI workflow to make the “find and write answer” activity more efficient. Yet, this activity is often not a single call to ChatGPT or another LLM but a collection of different tasks. In our example, the telco company has built a pipeline using the entAIngine process platform that consists of the following steps.

  • Extract the question and generate a query to the vector database. The example company has a vector database as knowledge for retrieval augmented generation (RAG). We need to extract the essence of the customer’s question from their request email to have the best query and find the sections in the knowledge base that are semantically as close as possible to the question.
  • Find context in the knowledge base. The semantic search activity is the next step in our process. Retrieval-reranking structures are often used to get the top k context chunks relevant to the query and sort them with an LLM. This step aims to retrieve the correct context information to generate the best answer possible.
  • Use context to generate an answer. This step orchestrates a large language model using a prompt and the selected context as input to the prompt.
  • Write an answer email. The final step transforms the pre-formulated answer into a formal email with the correct intro and ending to the message in the company’s desired tone and complexity.

The execution of processes like this is called the orchestration of an advanced LLM workflow. There are dozens of other orchestration architectures in enterprise contexts. Using a chat interface that uses the current prompt and the chat history is also a simple type of orchestration. Yet, for reproducible enterprise workflows with sensitive company data, using a simple chat orchestration is not enough in many cases, and advanced workflows like those shown above are needed.

Thus, when we evaluate complex processes for generative AI orchestrations in enterprise scenarios, looking purely at the capabilities of a foundational (or fine-tuned) model is, in many cases, just the start. The following section will dive deeper into what context and orchestration we need to evaluate generative AI applications.

2. Concept

The following sections introduce the core concepts for our approach.

My team has built the entAIngine platform that is, in that sense, quite unique in that it enables low-code generation of applications with generative AI tasks that are not necessarily a chatbot application. We have also implemented the following approach on entAIngine. If you want to try it out, message me. Or, if you want to build your own testbed functionality, feel free to get inspiration from the concept below.

2.1. Context and Orchestration of Performance Evaluation for Generative AI Applications

When evaluating the performance of generative AI applications in their orchestrations, we have the following choices: We can evaluate a foundational model in isolation, a fine-tuned model or either of those options as part of a larger orchestration, including several calls to different models and RAG. This has the following implications.

Context and orchestration for LLM-based applications. © Marcel Müller, 2025

Publicly available generative AI models like (for LLMs) GPT-4o, Llama 3.2 and many others were trained on the “public wisdom of the internet.” Their training sets included a large corpus of knowledge from books, world literature, Wikipedia articles, and other Internet crawls from forums and block posts. There is no company internal knowledge encoded in foundational models. Thus, when we evaluate the capabilities of a foundational model in evaluation, we can only evaluate the general capabilities of how queries are answered. However, the extensiveness of company-specific knowledge bases that show “how much the model knows” cannot be judged. There is only company-specific knowledge in foundational models with advanced orchestration that inserts company-specific context.

For example, with a free account from ChatGPT, anyone can ask, “How did Goethe die?” The model will provide an answer because the key information about Goethe’s life and death is in the model’s knowledge base. Yet, the question “How much revenue did our company make last year in Q3 in EMEA?” will most likely lead to a heavily hallucinated answer which will seem plausible to inexperienced users. However, we can still evaluate the form and representation of the answers, including style and tone, as well as language capabilities and skills concerning reasoning and logical deduction. Synthetic benchmarks such as ARC, HellaSwag, and MMLU provide comparative metrics for those dimensions. We will take a deeper look into those benchmarks in a later section.

Fine-tuned models build on foundational models. They use additional data sets to add foundational knowledge into a model that has not been there before by further training of the underlying machine learning model. Fine-tuned models have more context-specific knowledge. Suppose we orchestrate them in isolation without any other ingested data. In that case, we can evaluate the knowledge base concerning its suitability for real-world scenarios in a given business process. Fine-tuning is often used to focus on adding domain-specific vocabulary and sentence structures to a foundational model.

Suppose, we train a model on a corpus of legal court rulings. In that case, a fine-tuned model will start using the vocabulary and reproducing the sentence structure that is common in the legal domain. The model can combine some excerpts from old cases but fails to quote the right sources.

Orchestrating foundational models or fine-tuned models with retrieval-ation (RAG) produces highly context-dependent results. However, this also requires a more complex orchestration pipeline.

For example, a telco company, like in our example above, can use a language model to create embeddings of their customer support knowledge base and store them in a vector store. We can now efficiently query this knowledge base in a vector store with semantic search. By keeping track of the text segments that are retrieved, we can very precisely show the source of the retrieved text chunk and use it as context in a call to a large language model. This lets us answer our question end-to-end.

We can evaluate how well our application serves its intended purpose end-to-end for such large orchestrations with different data processing pipeline steps.

Evaluating those different types of setups gives us different insights that we can use in the development process of generative AI applications. We will look deeper into this aspect in the next section.

2.2 Evaluation of Generative AI Applications in the Development Lifecycle

We develop generative AI applications in different stages: 1) before building, 2) during build and testing, and 3) in production. With an agile approach, these stages are not executed in a linear sequence but iteratively. Yet, the goals and methods of evaluation in the different stages remain the same regardless of their order.

Before building, we need to evaluate which foundational model to choose or whether to create a new one from scratch. Therefore, we must first define our expectations and requirements, especially w.r.t. execution time, efficiency, price and quality. Currently, only very few companies decide to build their own foundational models from scratch due to cost and updating efforts. Fine-tuning and retrieval augmented generation are the standard tools to build highly personalized pipelines with traceable internal knowledge that leads to reproducible outputs. In this stage, synthetic benchmarks are the go-to approaches to achieve comparability. For example, if we want to build an application that helps lawyers prepare their cases, we need a model that is good at logical argumentation and understanding of a specific language.

During building, our evaluation needs to focus on satisfying the quality and performance requirements of the application’s example cases. In the case of building an application for lawyers, we need to make a representative selection of limited old cases. Those cases are the basis for defining standard scenarios of the application based on which we implement the application. For example, if the lawyer specializes in financial law and taxation, we would select a few of the standard cases for which this lawyer has to create scenarios. Every building and evaluation activity that we do in this phase has a limited view of representative scenarios and does not cover every instance. Yet, we need to evaluate the scenarios in the ongoing steps of application development.

In production, our evaluation approach focuses on quantitatively evaluating the real-world usage of our application with the expectations of live users. In production, we will find scenarios that are not covered in our building scenarios. The goal of the evaluation in this phase is to discover those scenarios and gather feedback from live users to improve the application further.

The production phase should always feed back into the development phase to improve the application iteratively. Hence, the three phases are not in a linear sequence, but interleaving.

2.3. Benchmark Metrics for Evaluation

With the “what” and “when” of the evaluation covered, we have to ask “how” we are going to evaluate our generative AI applications. Therefore, we have three different methods: Synthetic benchmarks, limited scenarios and feedback loop evaluation in production.

For synthetic benchmarks, we will look into the most commonly used approaches and compare them.

The AI2 Reasoning Challenge (ARC) tests an LLM’s knowledge and reasoning using a dataset of 7787 multiple-choice science questions. These questions range from 3rd to 9th grade and are divided into Easy and Challenge sets. ARC is useful for evaluating diverse knowledge types and pushing models to integrate information from multiple sentences. Its main benefit is comprehensive reasoning assessment, but it’s limited to scientific questions.

HellaSwag tests commonsense reasoning and natural language inference through sentence completion exercises based on real-world scenarios. Each exercise includes a video caption context and four possible endings. This benchmark measures an LLM’s understanding of everyday scenarios. Its main benefit is the complexity added by adversarial filtering, but it primarily focuses on general knowledge, limiting specialized domain testing.

The MMLU (Massive Multitask Language Understanding) benchmark measures an LLM’s natural language understanding across 57 tasks covering various subjects, from STEM to humanities. It includes 15,908 questions from elementary to advanced levels. MMLU is ideal for comprehensive knowledge assessment. Its broad coverage helps identify deficiencies, but limited construction details and errors may affect reliability.

TruthfulQA evaluates an LLM’s ability to generate truthful answers, addressing hallucinations in language models. It measures how accurately an LLM can respond, especially when training data is insufficient or low quality. This benchmark is useful for assessing accuracy and truthfulness, with the main benefit of focusing on factually correct answers. However, its general knowledge dataset may not reflect truthfulness in specialized domains.

The RAGAS framework is designed to evaluate Retrieval Augmented Generation (RAG) pipelines. It is a framework especially useful for a category of LLM applications that utilize external data to enhance the LLM’s context. The frameworks introduces metrics for faithfulness, answer relevancy, context recall, context precision, context relevancy, context entity recall and summarization score that can be used to assess in a differentiated view the quality of the retrieved outputs.

WinoGrande tests an LLM’s commonsense reasoning through pronoun resolution problems based on the Winograd Schema Challenge. It presents near-identical sentences with different answers based on a trigger word. This benchmark is beneficial for resolving ambiguities in pronoun references, featuring a large dataset and reduced bias. However, annotation artifacts remain a limitation.

The GSM8K benchmark measures an LLM’s multi-step mathematical reasoning using around 8,500 grade-school-level math problems. Each problem requires multiple steps involving basic arithmetic operations. This benchmark highlights weaknesses in mathematical reasoning, featuring diverse problem framing. However, the simplicity of problems may limit their long-term relevance.

SuperGLUE enhances the GLUE benchmark by testing an LLM’s NLU capabilities across eight diverse subtasks, including Boolean Questions and the Winograd Schema Challenge. It provides a thorough assessment of linguistic and commonsense knowledge. SuperGLUE is ideal for broad NLU evaluation, with comprehensive tasks offering detailed insights. However, fewer models are tested compared to benchmarks similar to MMLU.

HumanEval measures an LLM’s ability to generate functionally correct code through coding challenges and unit tests. It includes 164 coding problems with several unit tests per problem. This benchmark assesses coding and problem-solving capabilities, focusing on functional correctness similar to human evaluation. However, it only covers some practical coding tasks, limiting its comprehensiveness.

MT-Bench evaluates an LLM’s capability in multi-turn dialogues by simulating real-life conversational scenarios. It measures how effectively chatbots engage in conversations, following a natural dialogue flow. With a carefully curated dataset, MT-Bench is useful for assessing conversational abilities. However, its small dataset and the challenge of simulating real conversations still need to be improved.

All those metrics are synthetic and aim to provide a relative comparison between different LLMs. However, their concrete impact for a use case in a company depends on the classification of the challenge in the scenario to the benchmark. For example, in use cases for tax accounts where a lot of math is needed, GSM8K would be a good candidate to evaluate that capability. HumanEval is the initial tool of choice for the use of an LLM in a coding-related scenario.

2.4. Real-life Scenario-based Evaluation

However, the impact of those benchmarks is rather abstract and only gives an indication of their performance in an enterprise use case. This is where working with real-life scenarios is needed.

Real-life scenarios consist of the following components:

  • case-specific context data (input),
  • case-independent context data,
  • a sequence of tasks to complete and
  • the expected output.

With real-life test scenarios, we can model different situations, like

  • multi-step chat interactions with several questions and answers,
  • complex automation tasks with multiple AI interactions,
  • processes that involve RAG and
  • multi-modal process interactions.

In other words, it does not help anyone to have the best model in the world if the RAG pipeline always returns mediocre results because your chunking strategy is not good. Also, if you do not have the right data to answer your queries, you will always get some hallucinations that may or may not be close to the truth. In the same way, your results will vary based on the hyperparameters of your chosen models (temperature, frequency penalty, etc.). And we cannot use the most powerful model for every use case, if this is an expensive model.

Standard benchmarks focus on the individual models rather than on the big picture. That is why we introduce the PEEL framework for performance evaluation of enterprise LLM applications, which gives us an end-to-end view.

The core concept of PEEL is the evaluation scenario. We distinguish between an evaluation scenario definition and an evaluation scenario execution. The conceptual illustration shows the overall concepts in black, an example definition in blue and the outcome of one instance of an execution in green.

The concept of evaluation scenarios as introduced by the PEEL framework © Marcel Müller

An evaluation scenario definition consists of input definitions, an orchestration definition and an expected output definition.

For the input, we distinguish between case-specific and case-independent context data. Case-specific context data changes from case to case. For example, in the customer support use case, the question that a customer asks is different from customer case to customer case. In our example evaluation execution, we depicted one case where the email inquiry reads as follows:

“Dear customer support,

my name is […]. How do I reset my router when I move to a different apartment?

Kind regards, […] “

Yet, the knowledge base where the answers to the question are located in large documents is case-independent. In our example, we have a knowledge base with the pdf manuals for the routers AR83, AR93, AR94 and BD77 stored in a vector store.

An evaluation scenario definition has an orchestration. An orchestration consists of a series of n >= 1 steps that get in the evaluation scenario execution executed in sequence. Each step has inputs that it takes from any of the previous steps or from the input to the scenario execution. Steps can be interactions with LLMs (or other models), context retrieval tasks (for example, from a vector db) or other calls to data sources. For each step, we distinguish between the prompt / request and the execution parameters. The execution parameters include the model or method that needs to be executed and hyperparameters. The prompt / request is a collection of different static or dynamic data pieces that get concatenated (see illustration).

In our example, we have a three-step orchestration. In step 1, we extract a single question from the case-specific input context (the customer’s email inquiry). We use this question in step 2 to create a semantic search query in our vector database using the cosine similarity metric. The last step takes the search results and formulates an email using an LLM.

In an evaluation scenario definition, we have an expected output and an evaluation method. Here, we define for every scenario how we want to evaluate the actual outcome vs. the expected outcome. We have the following options:

  • Exact match/regex match: We check for the occurrence of a specific series of terms/concepts and give as an answer a boolean where 0 means that the defined terms did not appear in the output of the execution and 1 means they did appear. For example, the core concept of installing a router at a new location is pressing the reset button for 3 seconds. If the terms “reset button” and “3 seconds” are not part of the answer, we would evaluate it as a failure.
  • Semantic match: We check if the text is semantically close to what our expected answer is. Therefore, we use an LLM and task it to judge with a rational number between 0 and 1 how well the answer matches the expected answer.
  • Manual match: Humans evaluate the output on a scale between 0 and 1.

An evaluation scenario should be executed many times because LLMs are non-deterministic models. We want to have a reasonable number of executions so we can aggregate the scores and have a statistically significant output.

The benefit of using such scenarios is that we can use them while building and debugging our orchestrations. When we see that we have in 80 out of 100 executions of the same prompt a score of less than 0,3, we use this input to tweak or prompts or to add other data to our fine-tuning before orchestration.

2.5. Feedback Collection and Adjustment in Production

The principle for collecting feedback in production is analogous to the scenario approach. We map each user interaction to a scenario. If the user has larger degrees of freedom of interaction, we might need to create new scenarios that we did not anticipate during the building phase.

The user gets a slider between 0 and 1, where they can indicate how satisfied they were with the output of a result. From a user experience perspective, this number can also be simplified into different media, for example, a laughing, neutral and sad smiley. Thus, this evaluation is the manual match method that we introduced before.

In production, we have to create the same aggregations and metrics as before, just with live users and a potentially larger amount of data.

3. Example Implementation as Part of entAIngine Test Bed

Together with the entAIngine team, we have implemented the functionality on the platform. This section is to show you how things could be done and to give you inspiration. Or if you want to use what we have implemented , feel free to.

We map our concepts for evaluation scenarios and evaluation scenario definitions and map them to classic concepts of software testing. The start point for any interaction to create a new test is via the entAIngine application dashboard.

entAIngine dashboard © Marcel Müller

In entAIngine, users can create many different applications. Each of the applications is a set of processes that define workflows in a no-code interface. Processes consist of input templates (variables), RAG components, calls to LLMs, TTS, Image and Audio modules, integration to documents and OCR. With these components, we build reusable processes that can be integrated via an API, used as chat flows, used in a text editor as a dynamic text-generating block, or in a knowledge management search interface that shows the sources of answers. This functionality is, at the moment, already completely implemented in the entAIngine platform and can be used as SaaS or is 100% deployed on-premise. It integrates to existing gateways, data sources and models via API. We will use the process template generator to evaluation scenario definitions.

When the user wants to create a new test, they go to “test bed” and “tests”.

On the tests screen, the user can create new evaluation scenarios or edit existing ones. When creating a new evaluation scenario, the orchestration (an entAIngine process template) and a set of metrics must be defined. We assume we have a customer support scenario where we need to retrieve data with RAG to answer a question in the first step and then formulate an answer email in the second step. Then, we use the new module to name the test, define / select a process template and pick and evaluator that will create a score for every individual test case.

Test definition © Marcel Müller, 2025
Test case (process template) definition © Marcel Müller, 2025

The Metrics are as defined above: Regex match, semantic match and manual match. The screen with the process definition is already existing and functional, together with the orchestration. The functionality to define tests in bull as seen below is new.

Test and test cases © Marcel Müller, 2025

In the test editor, we work on an evaluation scenario definition (“evaluate how good our customer support answering RAG is”) and we define in this scenario different test cases. A test case assigns data values to the variables in the test. We can try 50 or 100 different instances of test cases and evaluate and aggregate them. For example, if we evaluate our customer support answering, we can define 100 different customer support requests, define our expected outcome and then execute them and analyze how good the answers were. Once we designed a set of test cases, we can execute their scenarios with the right variables using the existing orchestration engine and evaluate them.

Metrics and evaluation © Marcel Müller, 2025

This testing is happening during the building phase. We have an additional screen that we use to evaluate real user feedback in the productive phase. The contents are collected from real user feedback (through our engine and API).

The metrics that we have available in the live feedback section are collected from a user through a star rating.

Conclusion: Testing and Quality

In this article, we have looked into advanced testing and quality engineering concepts for generative AI applications, especially those that are more complex than simple chat bots. The introduced PEEL framework is a new approach for scenario-based test that is closer to the implementation level than the generic benchmarks with which we test models. For good applications, it is important to not only test the model in isolation, but in orchestration.

Get in touch with me

I am working in my day-real-world applications with generative AI, especially in the enterprise. If you want to connect, feel free to add me or send a message on LinkedIn.


Why Generative-AI Apps’ Quality Often Sucks and What to Do About It was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

​How to get from PoCs to tested high-quality applications in productionImage licensed from elements.envato.com, edit by Marcel Müller, 2025The generative AI hype has rolled through the business world in the past two years. This technology can make business process executions more efficient, reduce wait time, and reduce process defects. Some interfaces like ChatGPT make interacting with an LLM easy and accessible. Anyone with experience using a chat application can effortlessly type a query, and ChatGPT will always generate a response. Yet the quality and suitability for the intended use of your generated content may vary. This is especially true for enterprises that want to use generative AI technology in their business operations.I have spoken to countless managers and entrepreneurs who failed in their endeavors because they could not get high-quality generative AI applications to production and get reusable results from a non-deterministic model. On the other hand, I have also built more than three dozen AI applications and have realized one common misconception when people think about quality for generative AI applications: They think it is all about how powerful your underlying model is. But this is only 30% of the full story.But there are dozens of techniques, patterns, and architectures that help create impactful LLM-based applications of the quality that businesses desire. Different foundation models, fine-tuned models, architectures with retrieval augmented generation (RAG) and advanced processing pipelines are just the tip of the iceberg.This article shows how we can qualitatively and quantitatively evaluate generative AI applications in the context of concrete business processes. We will not stop at generic benchmarks but introduce approaches to evaluating applications with generative AI. After a quick analysis of generative AI applications and their business processes, we will look into the following questions:In what context do we need to evaluate generative AI applications to assess their end-to-end quality and utility in enterprise applications?When in the development life cycle of applications with generative AI, do we use different approaches for evaluation, and what are the objectives?How do we use different metrics in isolation and production to select, monitor and improve the quality of generative AI applications?This overview will give us an end-to-end evaluation framework for generative AI applications in enterprise scenarios that I call the PEEL (performance evaluation for enterprise LLM applications). Based on the conceptual framework created in this article, we will introduce an implementation concept as an addition to the entAIngine Test Bed module as part of the entAIngine platform.1. Background: Business Processes and Generative AIAn organization lives by its business processes. Everything in a company can be a business process, such as customer support, software development, and operations processes. Generative AI can improve our business processes by making them faster and more efficient, reducing wait time and improving the outcome quality of our processes. Yet, we can further divide each process activity that uses generative AI even more.Processes for generative AI applications. © 2025, Marcel MüllerThe illustration shows the start of a simple business that a telecommunications company’s customer support agent must go through. Every time a new customer support request comes in, the customer support agent has to give it a priority-level. When the work items on their list come to the point that the request has priority, the customer support agents must find the correct answer and write an answer email. Afterward, they need to send the email to the customers and wait for a reply, and they iterate until the request is solved.We can use a generative AI workflow to make the “find and write answer” activity more efficient. Yet, this activity is often not a single call to ChatGPT or another LLM but a collection of different tasks. In our example, the telco company has built a pipeline using the entAIngine process platform that consists of the following steps.Extract the question and generate a query to the vector database. The example company has a vector database as knowledge for retrieval augmented generation (RAG). We need to extract the essence of the customer’s question from their request email to have the best query and find the sections in the knowledge base that are semantically as close as possible to the question.Find context in the knowledge base. The semantic search activity is the next step in our process. Retrieval-reranking structures are often used to get the top k context chunks relevant to the query and sort them with an LLM. This step aims to retrieve the correct context information to generate the best answer possible.Use context to generate an answer. This step orchestrates a large language model using a prompt and the selected context as input to the prompt.Write an answer email. The final step transforms the pre-formulated answer into a formal email with the correct intro and ending to the message in the company’s desired tone and complexity.The execution of processes like this is called the orchestration of an advanced LLM workflow. There are dozens of other orchestration architectures in enterprise contexts. Using a chat interface that uses the current prompt and the chat history is also a simple type of orchestration. Yet, for reproducible enterprise workflows with sensitive company data, using a simple chat orchestration is not enough in many cases, and advanced workflows like those shown above are needed.Thus, when we evaluate complex processes for generative AI orchestrations in enterprise scenarios, looking purely at the capabilities of a foundational (or fine-tuned) model is, in many cases, just the start. The following section will dive deeper into what context and orchestration we need to evaluate generative AI applications.2. ConceptThe following sections introduce the core concepts for our approach.My team has built the entAIngine platform that is, in that sense, quite unique in that it enables low-code generation of applications with generative AI tasks that are not necessarily a chatbot application. We have also implemented the following approach on entAIngine. If you want to try it out, message me. Or, if you want to build your own testbed functionality, feel free to get inspiration from the concept below.2.1. Context and Orchestration of Performance Evaluation for Generative AI ApplicationsWhen evaluating the performance of generative AI applications in their orchestrations, we have the following choices: We can evaluate a foundational model in isolation, a fine-tuned model or either of those options as part of a larger orchestration, including several calls to different models and RAG. This has the following implications.Context and orchestration for LLM-based applications. © Marcel Müller, 2025Publicly available generative AI models like (for LLMs) GPT-4o, Llama 3.2 and many others were trained on the “public wisdom of the internet.” Their training sets included a large corpus of knowledge from books, world literature, Wikipedia articles, and other Internet crawls from forums and block posts. There is no company internal knowledge encoded in foundational models. Thus, when we evaluate the capabilities of a foundational model in evaluation, we can only evaluate the general capabilities of how queries are answered. However, the extensiveness of company-specific knowledge bases that show “how much the model knows” cannot be judged. There is only company-specific knowledge in foundational models with advanced orchestration that inserts company-specific context.For example, with a free account from ChatGPT, anyone can ask, “How did Goethe die?” The model will provide an answer because the key information about Goethe’s life and death is in the model’s knowledge base. Yet, the question “How much revenue did our company make last year in Q3 in EMEA?” will most likely lead to a heavily hallucinated answer which will seem plausible to inexperienced users. However, we can still evaluate the form and representation of the answers, including style and tone, as well as language capabilities and skills concerning reasoning and logical deduction. Synthetic benchmarks such as ARC, HellaSwag, and MMLU provide comparative metrics for those dimensions. We will take a deeper look into those benchmarks in a later section.Fine-tuned models build on foundational models. They use additional data sets to add foundational knowledge into a model that has not been there before by further training of the underlying machine learning model. Fine-tuned models have more context-specific knowledge. Suppose we orchestrate them in isolation without any other ingested data. In that case, we can evaluate the knowledge base concerning its suitability for real-world scenarios in a given business process. Fine-tuning is often used to focus on adding domain-specific vocabulary and sentence structures to a foundational model.Suppose, we train a model on a corpus of legal court rulings. In that case, a fine-tuned model will start using the vocabulary and reproducing the sentence structure that is common in the legal domain. The model can combine some excerpts from old cases but fails to quote the right sources.Orchestrating foundational models or fine-tuned models with retrieval-ation (RAG) produces highly context-dependent results. However, this also requires a more complex orchestration pipeline.For example, a telco company, like in our example above, can use a language model to create embeddings of their customer support knowledge base and store them in a vector store. We can now efficiently query this knowledge base in a vector store with semantic search. By keeping track of the text segments that are retrieved, we can very precisely show the source of the retrieved text chunk and use it as context in a call to a large language model. This lets us answer our question end-to-end.We can evaluate how well our application serves its intended purpose end-to-end for such large orchestrations with different data processing pipeline steps.Evaluating those different types of setups gives us different insights that we can use in the development process of generative AI applications. We will look deeper into this aspect in the next section.2.2 Evaluation of Generative AI Applications in the Development LifecycleWe develop generative AI applications in different stages: 1) before building, 2) during build and testing, and 3) in production. With an agile approach, these stages are not executed in a linear sequence but iteratively. Yet, the goals and methods of evaluation in the different stages remain the same regardless of their order.Before building, we need to evaluate which foundational model to choose or whether to create a new one from scratch. Therefore, we must first define our expectations and requirements, especially w.r.t. execution time, efficiency, price and quality. Currently, only very few companies decide to build their own foundational models from scratch due to cost and updating efforts. Fine-tuning and retrieval augmented generation are the standard tools to build highly personalized pipelines with traceable internal knowledge that leads to reproducible outputs. In this stage, synthetic benchmarks are the go-to approaches to achieve comparability. For example, if we want to build an application that helps lawyers prepare their cases, we need a model that is good at logical argumentation and understanding of a specific language.During building, our evaluation needs to focus on satisfying the quality and performance requirements of the application’s example cases. In the case of building an application for lawyers, we need to make a representative selection of limited old cases. Those cases are the basis for defining standard scenarios of the application based on which we implement the application. For example, if the lawyer specializes in financial law and taxation, we would select a few of the standard cases for which this lawyer has to create scenarios. Every building and evaluation activity that we do in this phase has a limited view of representative scenarios and does not cover every instance. Yet, we need to evaluate the scenarios in the ongoing steps of application development.In production, our evaluation approach focuses on quantitatively evaluating the real-world usage of our application with the expectations of live users. In production, we will find scenarios that are not covered in our building scenarios. The goal of the evaluation in this phase is to discover those scenarios and gather feedback from live users to improve the application further.The production phase should always feed back into the development phase to improve the application iteratively. Hence, the three phases are not in a linear sequence, but interleaving.2.3. Benchmark Metrics for EvaluationWith the “what” and “when” of the evaluation covered, we have to ask “how” we are going to evaluate our generative AI applications. Therefore, we have three different methods: Synthetic benchmarks, limited scenarios and feedback loop evaluation in production.For synthetic benchmarks, we will look into the most commonly used approaches and compare them.The AI2 Reasoning Challenge (ARC) tests an LLM’s knowledge and reasoning using a dataset of 7787 multiple-choice science questions. These questions range from 3rd to 9th grade and are divided into Easy and Challenge sets. ARC is useful for evaluating diverse knowledge types and pushing models to integrate information from multiple sentences. Its main benefit is comprehensive reasoning assessment, but it’s limited to scientific questions.HellaSwag tests commonsense reasoning and natural language inference through sentence completion exercises based on real-world scenarios. Each exercise includes a video caption context and four possible endings. This benchmark measures an LLM’s understanding of everyday scenarios. Its main benefit is the complexity added by adversarial filtering, but it primarily focuses on general knowledge, limiting specialized domain testing.The MMLU (Massive Multitask Language Understanding) benchmark measures an LLM’s natural language understanding across 57 tasks covering various subjects, from STEM to humanities. It includes 15,908 questions from elementary to advanced levels. MMLU is ideal for comprehensive knowledge assessment. Its broad coverage helps identify deficiencies, but limited construction details and errors may affect reliability.TruthfulQA evaluates an LLM’s ability to generate truthful answers, addressing hallucinations in language models. It measures how accurately an LLM can respond, especially when training data is insufficient or low quality. This benchmark is useful for assessing accuracy and truthfulness, with the main benefit of focusing on factually correct answers. However, its general knowledge dataset may not reflect truthfulness in specialized domains.The RAGAS framework is designed to evaluate Retrieval Augmented Generation (RAG) pipelines. It is a framework especially useful for a category of LLM applications that utilize external data to enhance the LLM’s context. The frameworks introduces metrics for faithfulness, answer relevancy, context recall, context precision, context relevancy, context entity recall and summarization score that can be used to assess in a differentiated view the quality of the retrieved outputs.WinoGrande tests an LLM’s commonsense reasoning through pronoun resolution problems based on the Winograd Schema Challenge. It presents near-identical sentences with different answers based on a trigger word. This benchmark is beneficial for resolving ambiguities in pronoun references, featuring a large dataset and reduced bias. However, annotation artifacts remain a limitation.The GSM8K benchmark measures an LLM’s multi-step mathematical reasoning using around 8,500 grade-school-level math problems. Each problem requires multiple steps involving basic arithmetic operations. This benchmark highlights weaknesses in mathematical reasoning, featuring diverse problem framing. However, the simplicity of problems may limit their long-term relevance.SuperGLUE enhances the GLUE benchmark by testing an LLM’s NLU capabilities across eight diverse subtasks, including Boolean Questions and the Winograd Schema Challenge. It provides a thorough assessment of linguistic and commonsense knowledge. SuperGLUE is ideal for broad NLU evaluation, with comprehensive tasks offering detailed insights. However, fewer models are tested compared to benchmarks similar to MMLU.HumanEval measures an LLM’s ability to generate functionally correct code through coding challenges and unit tests. It includes 164 coding problems with several unit tests per problem. This benchmark assesses coding and problem-solving capabilities, focusing on functional correctness similar to human evaluation. However, it only covers some practical coding tasks, limiting its comprehensiveness.MT-Bench evaluates an LLM’s capability in multi-turn dialogues by simulating real-life conversational scenarios. It measures how effectively chatbots engage in conversations, following a natural dialogue flow. With a carefully curated dataset, MT-Bench is useful for assessing conversational abilities. However, its small dataset and the challenge of simulating real conversations still need to be improved.All those metrics are synthetic and aim to provide a relative comparison between different LLMs. However, their concrete impact for a use case in a company depends on the classification of the challenge in the scenario to the benchmark. For example, in use cases for tax accounts where a lot of math is needed, GSM8K would be a good candidate to evaluate that capability. HumanEval is the initial tool of choice for the use of an LLM in a coding-related scenario.2.4. Real-life Scenario-based EvaluationHowever, the impact of those benchmarks is rather abstract and only gives an indication of their performance in an enterprise use case. This is where working with real-life scenarios is needed.Real-life scenarios consist of the following components:case-specific context data (input),case-independent context data,a sequence of tasks to complete andthe expected output.With real-life test scenarios, we can model different situations, likemulti-step chat interactions with several questions and answers,complex automation tasks with multiple AI interactions,processes that involve RAG andmulti-modal process interactions.In other words, it does not help anyone to have the best model in the world if the RAG pipeline always returns mediocre results because your chunking strategy is not good. Also, if you do not have the right data to answer your queries, you will always get some hallucinations that may or may not be close to the truth. In the same way, your results will vary based on the hyperparameters of your chosen models (temperature, frequency penalty, etc.). And we cannot use the most powerful model for every use case, if this is an expensive model.Standard benchmarks focus on the individual models rather than on the big picture. That is why we introduce the PEEL framework for performance evaluation of enterprise LLM applications, which gives us an end-to-end view.The core concept of PEEL is the evaluation scenario. We distinguish between an evaluation scenario definition and an evaluation scenario execution. The conceptual illustration shows the overall concepts in black, an example definition in blue and the outcome of one instance of an execution in green.The concept of evaluation scenarios as introduced by the PEEL framework © Marcel MüllerAn evaluation scenario definition consists of input definitions, an orchestration definition and an expected output definition.For the input, we distinguish between case-specific and case-independent context data. Case-specific context data changes from case to case. For example, in the customer support use case, the question that a customer asks is different from customer case to customer case. In our example evaluation execution, we depicted one case where the email inquiry reads as follows:“Dear customer support,my name is . How do I reset my router when I move to a different apartment?Kind regards,  “Yet, the knowledge base where the answers to the question are located in large documents is case-independent. In our example, we have a knowledge base with the pdf manuals for the routers AR83, AR93, AR94 and BD77 stored in a vector store.An evaluation scenario definition has an orchestration. An orchestration consists of a series of n >= 1 steps that get in the evaluation scenario execution executed in sequence. Each step has inputs that it takes from any of the previous steps or from the input to the scenario execution. Steps can be interactions with LLMs (or other models), context retrieval tasks (for example, from a vector db) or other calls to data sources. For each step, we distinguish between the prompt / request and the execution parameters. The execution parameters include the model or method that needs to be executed and hyperparameters. The prompt / request is a collection of different static or dynamic data pieces that get concatenated (see illustration).In our example, we have a three-step orchestration. In step 1, we extract a single question from the case-specific input context (the customer’s email inquiry). We use this question in step 2 to create a semantic search query in our vector database using the cosine similarity metric. The last step takes the search results and formulates an email using an LLM.In an evaluation scenario definition, we have an expected output and an evaluation method. Here, we define for every scenario how we want to evaluate the actual outcome vs. the expected outcome. We have the following options:Exact match/regex match: We check for the occurrence of a specific series of terms/concepts and give as an answer a boolean where 0 means that the defined terms did not appear in the output of the execution and 1 means they did appear. For example, the core concept of installing a router at a new location is pressing the reset button for 3 seconds. If the terms “reset button” and “3 seconds” are not part of the answer, we would evaluate it as a failure.Semantic match: We check if the text is semantically close to what our expected answer is. Therefore, we use an LLM and task it to judge with a rational number between 0 and 1 how well the answer matches the expected answer.Manual match: Humans evaluate the output on a scale between 0 and 1.An evaluation scenario should be executed many times because LLMs are non-deterministic models. We want to have a reasonable number of executions so we can aggregate the scores and have a statistically significant output.The benefit of using such scenarios is that we can use them while building and debugging our orchestrations. When we see that we have in 80 out of 100 executions of the same prompt a score of less than 0,3, we use this input to tweak or prompts or to add other data to our fine-tuning before orchestration.2.5. Feedback Collection and Adjustment in ProductionThe principle for collecting feedback in production is analogous to the scenario approach. We map each user interaction to a scenario. If the user has larger degrees of freedom of interaction, we might need to create new scenarios that we did not anticipate during the building phase.The user gets a slider between 0 and 1, where they can indicate how satisfied they were with the output of a result. From a user experience perspective, this number can also be simplified into different media, for example, a laughing, neutral and sad smiley. Thus, this evaluation is the manual match method that we introduced before.In production, we have to create the same aggregations and metrics as before, just with live users and a potentially larger amount of data.3. Example Implementation as Part of entAIngine Test BedTogether with the entAIngine team, we have implemented the functionality on the platform. This section is to show you how things could be done and to give you inspiration. Or if you want to use what we have implemented , feel free to.We map our concepts for evaluation scenarios and evaluation scenario definitions and map them to classic concepts of software testing. The start point for any interaction to create a new test is via the entAIngine application dashboard.entAIngine dashboard © Marcel MüllerIn entAIngine, users can create many different applications. Each of the applications is a set of processes that define workflows in a no-code interface. Processes consist of input templates (variables), RAG components, calls to LLMs, TTS, Image and Audio modules, integration to documents and OCR. With these components, we build reusable processes that can be integrated via an API, used as chat flows, used in a text editor as a dynamic text-generating block, or in a knowledge management search interface that shows the sources of answers. This functionality is, at the moment, already completely implemented in the entAIngine platform and can be used as SaaS or is 100% deployed on-premise. It integrates to existing gateways, data sources and models via API. We will use the process template generator to evaluation scenario definitions.When the user wants to create a new test, they go to “test bed” and “tests”.On the tests screen, the user can create new evaluation scenarios or edit existing ones. When creating a new evaluation scenario, the orchestration (an entAIngine process template) and a set of metrics must be defined. We assume we have a customer support scenario where we need to retrieve data with RAG to answer a question in the first step and then formulate an answer email in the second step. Then, we use the new module to name the test, define / select a process template and pick and evaluator that will create a score for every individual test case.Test definition © Marcel Müller, 2025Test case (process template) definition © Marcel Müller, 2025The Metrics are as defined above: Regex match, semantic match and manual match. The screen with the process definition is already existing and functional, together with the orchestration. The functionality to define tests in bull as seen below is new.Test and test cases © Marcel Müller, 2025In the test editor, we work on an evaluation scenario definition (“evaluate how good our customer support answering RAG is”) and we define in this scenario different test cases. A test case assigns data values to the variables in the test. We can try 50 or 100 different instances of test cases and evaluate and aggregate them. For example, if we evaluate our customer support answering, we can define 100 different customer support requests, define our expected outcome and then execute them and analyze how good the answers were. Once we designed a set of test cases, we can execute their scenarios with the right variables using the existing orchestration engine and evaluate them.Metrics and evaluation © Marcel Müller, 2025This testing is happening during the building phase. We have an additional screen that we use to evaluate real user feedback in the productive phase. The contents are collected from real user feedback (through our engine and API).The metrics that we have available in the live feedback section are collected from a user through a star rating.Conclusion: Testing and QualityIn this article, we have looked into advanced testing and quality engineering concepts for generative AI applications, especially those that are more complex than simple chat bots. The introduced PEEL framework is a new approach for scenario-based test that is closer to the implementation level than the generic benchmarks with which we test models. For good applications, it is important to not only test the model in isolation, but in orchestration.Get in touch with meI am working in my day-real-world applications with generative AI, especially in the enterprise. If you want to connect, feel free to add me or send a message on LinkedIn.Why Generative-AI Apps’ Quality Often Sucks and What to Do About It was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.  ai, llm-evaluation, deep-dives, llm, generative-ai-tools Towards Data Science – MediumRead More

How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

FavoriteLoadingAdd to favorites

Dr. Owns

January 20, 2025

Recent Posts

0 Comments

Submit a Comment