September 13, 2024

Document summarization is important for GenAI use-cases, but what if the documents are too BIG!? Read on to find out how I have solved it.

“Summarizing a lot of text”— Image generated with GPT-4o

Document summarization today has become one of the most (if not the most) common problem statements to solve using modern Generative AI (GenAI) technology. Retrieval Augmented Generation (RAG) is a common yet effective solution architecture used to solve it. But what if the document itself is so large that it cannot be sent as a whole in a single API request? Or what if it produces too many chunks to cause the infamous ‘Lost in the Middle’ context problem? In this article, I will discuss the challenges we face with such a problem statement, and go through a step-by-step solution that I applied using the guidance offered by Greg Kamradt in his GitHub repository.

Some “context”

RAG is a well-discussed and widely implemented solution for addressing document summarizing optimization using GenAI technologies. However, like any new technology or solution, it is prone to edge-case challenges, especially in today’s enterprise environment. Two main concerns are contextual length coupled with per-prompt cost and the previously mentioned ‘Lost in the Middle’ context problem. Let’s dive a bit deeper to understand these challenges.

Note: I will be performing the exercises in Python using the LangChain, Scikit-Learn, Numpy and Matplotlib libraries for quick iterations.

Context window and Cost constraints

Today with automated workflows enabled by GenAI, analyzing big documents has become an industry expectation/requirement. People want to quickly find relevant information from medical reports or financial audits by just prompting the LLM. But there is a caveat, enterprise documents are not like documents or datasets we deal with in academics, the sizes are considerably bigger and the pertinent information can be present pretty much anywhere in the documents. Hence, methods like data cleaning/filtering are often not a viable option since domain knowledge regarding these documents is not always given.

In addition to this, even the latest Large Language Models (LLMs) like GPT-4o by OpenAI with context windows of 128K tokens cannot just consume these documents in one shot or even if they did, the quality of response will not meet standards, especially for the cost it will incur. To showcase this, let’s take a real-world example of trying to summarize the Employee Handbook of GitLab which can downloaded here. This document is available free of charge under the MIT license available on their GitHub repository.

1 We start by loading the document and also initialize our LLM, to keep this exercise relevant I will make use of GPT-4o.

from langchain_community.document_loaders import PyPDFLoader

# Load PDFs
pdf_paths = [“/content/gitlab_handbook.pdf”]
documents = []

for path in pdf_paths:
loader = PyPDFLoader(path)
documents.extend(loader.load())

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model=”gpt-4o”)

2 Then we can divide the document into smaller chunks (this is for embedding, I will explain why in the later steps).

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

# Split documents into chunks
splits = text_splitter.split_documents(documents)

3 Now, let’s calculate how many tokens make up this document, for this we will iterate through each document chunk and calculate the total tokens that make up the document.

total_tokens = 0

for chunk in splits:
text = chunk.page_content # Assuming `page_content` is where the text is stored
num_tokens = llm.get_num_tokens(text) # Get the token count for each chunk
total_tokens += num_tokens

print(f”Total number of tokens in the book: {total_tokens}”)

# Total number of tokens in the book: 254006

As we can see the number of tokens is 254,006, while the context window limit for GPT-4o is 128,000. This document cannot be sent in one go through the LLM’s API. In addition to this, considering this model’s pricing is $0.00500 / 1K input tokens, a single request sent to OpenAI for this document would cost $1.27! This does not sound horrible until you present this in an enterprise paradigm with multiple users and daily interactions across many such large documents, especially in a startup scenario where many GenAI solutions are being born.

Lost in the Middle

Another challenge faced by LLMs is the Lost in the Middle, context problem as discussed in detail in this paper. Research and my experiences with RAG systems handling multiple documents describe that LLMs are not very robust when it comes to extrapolating information from long context inputs. Model performance degrades considerably when relevant information is somewhere in the middle of the context. However, the performance improves when the required information is either at the beginning or the end of the provided context. Document Re-ranking is a solution that has become a subject of progressively heavy discussion and research to tackle this specific issue. I will be exploring a few of these methods in another post. For now, let us get back to the solution we are exploring which utilizes K-Means Clustering.

What is K-Means Clustering?!

Okay, I admit I sneaked in a technical concept in the last section, allow me to explain it (for those who may not be aware of the method, I got you).

First the basics

To understand K-means clustering, we should first know what clustering is. Consider this: we have a messy desk with pens, pencils, and notes all scattered together. To clean up, one would group like items together like all pens in one group, pencils in another, and notes in another creating essentially 3 separate groups (not promoting segregation). Clustering is the same process where among a collection of data (in our case the different chunks of document text), similar data or information are grouped creating a clear separation of concerns for the model, making it easier for our RAG system to pick and choose information effectively and efficiently instead of having to go through it all like a greedy method.

K, Means?

K-means is a specific method to perform clustering (there are other methods but let’s not information dump). Let me explain how it works in 5 simple steps:

Picking the number of groups (K): How many groups we want the data to be divided intoSelecting group centers: Initially, a center value for each of the K-groups is randomly selectedGroup assignment: Each data point is then assigned to each group based on how close it is to the previously chosen centers. Example: items closest to center 1 are assigned to group 1, items closest to center 2 will be assigned to group 2…and so on till Kth group.Adjusting the centers: After all the data points have been pigeonholed, we calculate the average of the positions of the items in each group and these averages become the new centers to improve accuracy (because we had initially selected them at random).Rinse and repeat: With the new centers, the data point assignments are again updated for the K-groups. This is done till the difference (mathematically the Euclidean distance) is minimal for items within a group and the maximal from other data points of other groups, ergo optimal segregation.

While this may be quite a simplified explanation, a more detailed and technical explanation (for my fellow nerds) of this algorithm can be found here.

Enough theory, let’s code.

Now that we have discussed K-means clustering which is the main protagonist in our journey to optimization, let us see how this robust algorithm can be used in practice to summarize our Handbook.

4 Now that we have our chunks of document text, we will be embedding them into vectors.

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Embed the chunks
chunk_texts = [chunk.page_content for chunk in splits] # Extract the text from each chunk
chunk_embeddings = embeddings.embed_documents(chunk_texts)

Maybe a little theory

Alright, alright so maybe there’s more to learn here — what’s embedding? Vectors?! and why?

Embedding & Vectors

Think of how a computer does things — it sees everything as binary, ergo the best language to teach/instruct it is in numbers. Hence, an optimal way to have complex ML systems understand our data is to see all that text as numbers, and that very method by which we do this conversion is called Embedding. The number list describing the text or word is known as Vectors.

Embeddings can differ depending on how we want to describe our data and the heuristics we choose. Let’s say we wanted to describe an apple, we need to consider its color (Red), its shape (Roundness), and its size. Each of these could be encoded as numbers, like the ‘redness’ could be an 8 on a scale of 1–10. The roundness could be 9 and the size could be 3 (inches in width). Hence, our vector for representing the apple would be [8,9,3]. This very concept is applied in more complexity when describing different qualities of documents where we want each number to map the topic, the semantic relationships, etc. This would result in vectors having hundreds or more numbers long.

But, Why?!

Now, what improvements does this method provide? Firstly as I mentioned before, it makes data interpretation for the LLMs easier which provides better accuracy in inference from the models. Secondly, it also helps massively, in memory optimization (space complexity in technical terms), by reducing the amount of memory consumption by converting the data into vectors. The paradigm of vectors is known as vector space, for example: A document with 1000 words can be reduced to a 768-dimensional vector representation, hence, resulting in a 768 number representation instead of 1000 words.

A little deeper math (for my dear nerds again), “1234” in word (or strings in computer language) form would consume 54 bytes of memory, while the 1234 in numeral (integers in computer language) form would consume only 8 bytes! So if you were to consider documents consuming Megabytes, we are reducing memory management costs as well (yay, budget!).

And we are back!

5 Using the Scikit-Learn Python library for easy implementation, we first select the number of clusters we want, in our case 15. We then run the algorithm to fit our embedded documents into 15 clusters. The parameter ‘random_state = 42’ means that we are shuffling the dataset to prevent pattern bias in our model.

It is also important to note that we are converting our list of embeddings into a Numpy array (a mathematical representation of vectors for advanced operation in the Numpy library). This is because Scikit-learn requires Numpy arrays for K-means operation.

from sklearn.cluster import KMeans
import numpy as np

num_clusters = 15
# Convert the list of embeddings to a NumPy array
chunk_embeddings_array = np.array(chunk_embeddings)

# Perform K-means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42).fit(chunk_embeddings)

Class dismissed…for now.

I think this is a good place for a pit stop! We have covered much both in code and theory. But no worries, I will be posting a second part covering how we make use of these clusters in generating rich summaries for large documents. There are going to be more interesting techniques to showcase and of course, I will be explaining through all the theory and understanding as best as I can!

So stay tuned! Also, I would love your feedback and any comments you may have regarding this article, as it really helps me improve my content, and as always, Thank you so much for trading and I hope it was worth the read!

Photo by Priscilla Du Preez 🇨🇦 on Unsplash

The Essential Guide to Effectively Summarizing Massive Documents, Part 1 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

​Document summarization is important for GenAI use-cases, but what if the documents are too BIG!? Read on to find out how I have solved it.“Summarizing a lot of text”— Image generated with GPT-4oDocument summarization today has become one of the most (if not the most) common problem statements to solve using modern Generative AI (GenAI) technology. Retrieval Augmented Generation (RAG) is a common yet effective solution architecture used to solve it. But what if the document itself is so large that it cannot be sent as a whole in a single API request? Or what if it produces too many chunks to cause the infamous ‘Lost in the Middle’ context problem? In this article, I will discuss the challenges we face with such a problem statement, and go through a step-by-step solution that I applied using the guidance offered by Greg Kamradt in his GitHub repository.Some “context”RAG is a well-discussed and widely implemented solution for addressing document summarizing optimization using GenAI technologies. However, like any new technology or solution, it is prone to edge-case challenges, especially in today’s enterprise environment. Two main concerns are contextual length coupled with per-prompt cost and the previously mentioned ‘Lost in the Middle’ context problem. Let’s dive a bit deeper to understand these challenges.Note: I will be performing the exercises in Python using the LangChain, Scikit-Learn, Numpy and Matplotlib libraries for quick iterations.Context window and Cost constraintsToday with automated workflows enabled by GenAI, analyzing big documents has become an industry expectation/requirement. People want to quickly find relevant information from medical reports or financial audits by just prompting the LLM. But there is a caveat, enterprise documents are not like documents or datasets we deal with in academics, the sizes are considerably bigger and the pertinent information can be present pretty much anywhere in the documents. Hence, methods like data cleaning/filtering are often not a viable option since domain knowledge regarding these documents is not always given.In addition to this, even the latest Large Language Models (LLMs) like GPT-4o by OpenAI with context windows of 128K tokens cannot just consume these documents in one shot or even if they did, the quality of response will not meet standards, especially for the cost it will incur. To showcase this, let’s take a real-world example of trying to summarize the Employee Handbook of GitLab which can downloaded here. This document is available free of charge under the MIT license available on their GitHub repository.1 We start by loading the document and also initialize our LLM, to keep this exercise relevant I will make use of GPT-4o.from langchain_community.document_loaders import PyPDFLoader# Load PDFspdf_paths = [“/content/gitlab_handbook.pdf”]documents = []for path in pdf_paths: loader = PyPDFLoader(path) documents.extend(loader.load())from langchain_openai import ChatOpenAIllm = ChatOpenAI(model=”gpt-4o”)2 Then we can divide the document into smaller chunks (this is for embedding, I will explain why in the later steps).from langchain.text_splitter import RecursiveCharacterTextSplitter# Initialize the text splittertext_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)# Split documents into chunkssplits = text_splitter.split_documents(documents)3 Now, let’s calculate how many tokens make up this document, for this we will iterate through each document chunk and calculate the total tokens that make up the document.total_tokens = 0for chunk in splits: text = chunk.page_content # Assuming `page_content` is where the text is stored num_tokens = llm.get_num_tokens(text) # Get the token count for each chunk total_tokens += num_tokensprint(f”Total number of tokens in the book: {total_tokens}”)# Total number of tokens in the book: 254006As we can see the number of tokens is 254,006, while the context window limit for GPT-4o is 128,000. This document cannot be sent in one go through the LLM’s API. In addition to this, considering this model’s pricing is $0.00500 / 1K input tokens, a single request sent to OpenAI for this document would cost $1.27! This does not sound horrible until you present this in an enterprise paradigm with multiple users and daily interactions across many such large documents, especially in a startup scenario where many GenAI solutions are being born.Lost in the MiddleAnother challenge faced by LLMs is the Lost in the Middle, context problem as discussed in detail in this paper. Research and my experiences with RAG systems handling multiple documents describe that LLMs are not very robust when it comes to extrapolating information from long context inputs. Model performance degrades considerably when relevant information is somewhere in the middle of the context. However, the performance improves when the required information is either at the beginning or the end of the provided context. Document Re-ranking is a solution that has become a subject of progressively heavy discussion and research to tackle this specific issue. I will be exploring a few of these methods in another post. For now, let us get back to the solution we are exploring which utilizes K-Means Clustering.What is K-Means Clustering?!Okay, I admit I sneaked in a technical concept in the last section, allow me to explain it (for those who may not be aware of the method, I got you).First the basicsTo understand K-means clustering, we should first know what clustering is. Consider this: we have a messy desk with pens, pencils, and notes all scattered together. To clean up, one would group like items together like all pens in one group, pencils in another, and notes in another creating essentially 3 separate groups (not promoting segregation). Clustering is the same process where among a collection of data (in our case the different chunks of document text), similar data or information are grouped creating a clear separation of concerns for the model, making it easier for our RAG system to pick and choose information effectively and efficiently instead of having to go through it all like a greedy method.K, Means?K-means is a specific method to perform clustering (there are other methods but let’s not information dump). Let me explain how it works in 5 simple steps:Picking the number of groups (K): How many groups we want the data to be divided intoSelecting group centers: Initially, a center value for each of the K-groups is randomly selectedGroup assignment: Each data point is then assigned to each group based on how close it is to the previously chosen centers. Example: items closest to center 1 are assigned to group 1, items closest to center 2 will be assigned to group 2…and so on till Kth group.Adjusting the centers: After all the data points have been pigeonholed, we calculate the average of the positions of the items in each group and these averages become the new centers to improve accuracy (because we had initially selected them at random).Rinse and repeat: With the new centers, the data point assignments are again updated for the K-groups. This is done till the difference (mathematically the Euclidean distance) is minimal for items within a group and the maximal from other data points of other groups, ergo optimal segregation.While this may be quite a simplified explanation, a more detailed and technical explanation (for my fellow nerds) of this algorithm can be found here.Enough theory, let’s code.Now that we have discussed K-means clustering which is the main protagonist in our journey to optimization, let us see how this robust algorithm can be used in practice to summarize our Handbook.4 Now that we have our chunks of document text, we will be embedding them into vectors.from langchain_openai import OpenAIEmbeddingsembeddings = OpenAIEmbeddings()# Embed the chunkschunk_texts = [chunk.page_content for chunk in splits] # Extract the text from each chunkchunk_embeddings = embeddings.embed_documents(chunk_texts)Maybe a little theoryAlright, alright so maybe there’s more to learn here — what’s embedding? Vectors?! and why?Embedding & VectorsThink of how a computer does things — it sees everything as binary, ergo the best language to teach/instruct it is in numbers. Hence, an optimal way to have complex ML systems understand our data is to see all that text as numbers, and that very method by which we do this conversion is called Embedding. The number list describing the text or word is known as Vectors.Embeddings can differ depending on how we want to describe our data and the heuristics we choose. Let’s say we wanted to describe an apple, we need to consider its color (Red), its shape (Roundness), and its size. Each of these could be encoded as numbers, like the ‘redness’ could be an 8 on a scale of 1–10. The roundness could be 9 and the size could be 3 (inches in width). Hence, our vector for representing the apple would be [8,9,3]. This very concept is applied in more complexity when describing different qualities of documents where we want each number to map the topic, the semantic relationships, etc. This would result in vectors having hundreds or more numbers long.But, Why?!Now, what improvements does this method provide? Firstly as I mentioned before, it makes data interpretation for the LLMs easier which provides better accuracy in inference from the models. Secondly, it also helps massively, in memory optimization (space complexity in technical terms), by reducing the amount of memory consumption by converting the data into vectors. The paradigm of vectors is known as vector space, for example: A document with 1000 words can be reduced to a 768-dimensional vector representation, hence, resulting in a 768 number representation instead of 1000 words.A little deeper math (for my dear nerds again), “1234” in word (or strings in computer language) form would consume 54 bytes of memory, while the 1234 in numeral (integers in computer language) form would consume only 8 bytes! So if you were to consider documents consuming Megabytes, we are reducing memory management costs as well (yay, budget!).And we are back!5 Using the Scikit-Learn Python library for easy implementation, we first select the number of clusters we want, in our case 15. We then run the algorithm to fit our embedded documents into 15 clusters. The parameter ‘random_state = 42’ means that we are shuffling the dataset to prevent pattern bias in our model.It is also important to note that we are converting our list of embeddings into a Numpy array (a mathematical representation of vectors for advanced operation in the Numpy library). This is because Scikit-learn requires Numpy arrays for K-means operation.from sklearn.cluster import KMeansimport numpy as npnum_clusters = 15# Convert the list of embeddings to a NumPy arraychunk_embeddings_array = np.array(chunk_embeddings)# Perform K-means clusteringkmeans = KMeans(n_clusters=num_clusters, random_state=42).fit(chunk_embeddings)Class dismissed…for now.I think this is a good place for a pit stop! We have covered much both in code and theory. But no worries, I will be posting a second part covering how we make use of these clusters in generating rich summaries for large documents. There are going to be more interesting techniques to showcase and of course, I will be explaining through all the theory and understanding as best as I can!So stay tuned! Also, I would love your feedback and any comments you may have regarding this article, as it really helps me improve my content, and as always, Thank you so much for trading and I hope it was worth the read!Photo by Priscilla Du Preez 🇨🇦 on UnsplashThe Essential Guide to Effectively Summarizing Massive Documents, Part 1 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.  large-language-models, document-summarization, retrieval-augmented-gen, unsupervised-learning, generative-ai-use-cases Towards Data Science – MediumRead More

How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

FavoriteLoadingAdd to favorites
September 13, 2024

Recent Posts

0 Comments

Submit a Comment