RAG: Retrieval Augmented Generation

A technical overview and discussion of the Generative AI pattern.

I gave an interesting and I think pretty good description of RAG – the generative AI pattern – earlier today at work. It’s pretty popular nowadays, both at the enterprise and consumer level, and I explained why we need it, what alternatives exist, and why you would choose rag over those alternatives, and I also covered some of the mathematical background for the technical implementation of RAG. So, since it was such a good discussion, I decided to polish up most of that explanation for my blog.

So generative AI is a pretty interesting tool to produce new text, usually for use cases like answering questions when prompted, or summarizing a piece of provided text, or explaining the semantic or emotional charge behind a tweet for example. Summarization, semantic extraction, and question answering via a virtual assistant-type of thing are common use cases for generative AI, and they are possible because these generative AI models, large language models (LLMs) are very broadly trained on human language. Generally speaking, I will talk about their grasp of the English language, but there are models capable of translating and speaking other languages as well. But a problem arises when I say broadly trained on language – they don’t have “domain specificity” in any particular topic beyond what was in the documents and text the models were trained on, which is generally what is available in the public domain online .

So a big problem to be solved that has been solved with a couple of different patterns is how to give these LLMs domain specificity. For example, financial documents have a lot of nuance. Being able to answer questions, or give insight into very particular documentation or procedure of law or finances – these things have particular codes or statements that are cited in a documents somewhere somewhere, it’s very esoteric, but the specifics matter – that’s a good instance of domain specificity. I’m working on a project right now with a university plant pathology department, where they’re diagnosing diseases on plants specifically potatoes in this context, and the names and species of bacteria or fungus or other ailments afflicting these plants is very important. The names of pesticides and specific numbers for the rates of application are important, so these are examples of domain specificity which these broad models do not generally have out of the box.

So how does one give these models domain specificity? The answer stemming from traditional machine learning would be to retrain machine learning models. Essentially these LLMs are just very, very inflated, very expensive traditional machine learning, models – sequence, and text processing models to be specific – but still machine learning models. You can retrain LLMs, and with generative AI that’s called “fine tuning” – you pass new documents with domain specific language or information and essentially retrain the parameters and weights of these models so that they are more attuned (pun intended) to that new dataset. That’s one way to give these models domain specificity, but that option is very very expensive. Generally, it is on the same order of training the models initially; inferencing the models is one thing but training models is generally much much more expensive, so fine tuning is a relatively restrictive option for getting domain specificity.

There are other ways, such as a “prompt tuning”, which I will not get into here, which is another way without fine-tuning to get domain specificity and I’ll say it’s a medium solution. There’s some computational expenditure needed to get prompt tuning to work, but not as much as the final option, which is retrieval, augmented, generation, or RAG.

RAG utilizes prompt engineering to provide these LLMs domain specific information at prompt time. Prompt engineering is the way that you get specific outputs out of these models, whether that be asking it a question or providing it a document and asking it to summarize it, prompt engineering has blossomed as a field of necessary skills when working with LLMs. Let’s use that specific case where you’re providing a document and asking it to summarizing it, it doesn’t have knowledge of the document ahead of time in its training corpus. It’s a new document, so by injecting the document within the prompt and asking a question pertaining to that document, it now has the information available to produce an answer.

When asking a question that it’s not been trained on if we have documentation or some other text some source of information outside of the model that we can draw up upon and inject that information in a more dilute form into the prompt and then ask the model to get a specific piece of insight out of that. That is much more feasible and a much less expensive way to get domain specificity from these LLMs – by utilizing external information and giving it to the model at prompt time as opposed to storing and reciting on the information in the model parameters alone via fine tuning .

To recap, retrieval augmented generation, or RAG, works by taking in specifically relevant pieces from your set of documentation or a knowledge base, and injecting those pieces of text into the prompt so that the model can answer the questions based on information from the prompt alone. But some questions arise when I say relevant pieces of information. First off, it is necessary to filter information that goes into a prompt – if you have tens of documents with hundreds of pages to cite information from, LLMs have a limited amount of text they can process at a time so it is infeasible to provide an entire knowledgebase to the LLM at prompt time for any particular question. We have to pull preferably the most relevant information that is pertinent to the question into the prompt such as for example, the table from the document that the answer is within, or the block of text that explains the answer to our question, so technically speaking, how do we get this relevant information from our documents or from our knowledge base?

You may have heard of a technology that’s cropped up since the generative AI for specifically this purpose of retrieving relevant pieces of data, or accessing particular blocks of text from a large set or pool, and that is vector databases. So with vectors, we think of a series or a sequence of numbers representing coordinates or values in some dimensions in space, right? We exist in a 3-Dimensional physical plane. So we have GPS coordinates: longitude, latitude, and altitude to describe our 3-D coordinates. You can say any point on earth can be described with three-dimensional vector which corresponds to your longitude or latitude in your altitude, and you would be able to assess the “similarity” between vectors in the context of distances and points and physical space. A physical distance, Euclidean distance, right? The square root of the sum of the element-wise squared differences between each coordinate of two points. The distance between those points can give some semblance of similarity. For example, I and my neighbor’s coordinates are separated 200 to 300 feet that could be “more similar” than me and my buddy who lives in Sweden where we are 2500 miles apart, right?

This notion of similarity becomes more clear when we talk about vector describing data that’s not spatial, but same more describing attributes of something. If instead of physical X and Y and Z coordinates, we have a vector representing data about a person, say their height and weight, where their X is their height and their Y is their weight, the “Euclidean distance” calculated between two people that are both 59 and 170 and 180 pounds will be very short, i.e., a very high similarity. Conversely, the distance between a horse jockey rider that is 4′ 8″ and 130 lbs, and LeBron James who is 6′ 8″ and 270 lbs (I’m not exactly sure), their dimensions are much different; their distances are much farther and so they are not very similar to each other.

Another more nebulous similarity metrics that’s very commonly used in machine learning is “cosine similarity”. Cosine similarity is based on the definition of cosine with respect to vector algebra. The cosine of the angle between two vectors is equal to the normalized dot product between those vectors, and the dot product again is basically another element-wise comparison between two vectors; the cosine of the element wise comparison of those two vectors gives assemblance of their similarity, so the exact intuition behind that is beyond this discussion, but knowing that you can compare vectors in space to get a semblance of similarity between said vectors gives us a good precedence to try to find similar or relevant text from a large pool of potential candidates.

So the question is then, how does one convert text into a vector, upon which you can calculate similarities between other text representing vectors? The answer to that is with “embedding models”. Now LLMs themselves also represent text internally as numbers and vectors: first a text sequence like a question is converted into these numbers within the model, it processes those, and then formulates the answer output as numbers, and then translates them back into text essentially. But if you take the first half of LLMs – that is just taking text and converting it to numbers – you have a very powerful tool with the ability to correspond text with a numerical form with which you can calculate similarities. And the particularly cool thing about these embedding models is that they’ve been trained to embed semantic meaning into a high dimensional vector space, so if I have two sentences, “the cat runs over the field” and “the dog runs over the field”, the vector similarity between those takes into account the semantic meaning behind the words, and so these two sentences will have a high similarity, as opposed to “the cat runs through the field” and “astrophysics is a burgeoning field of science right now”, which have very semantically little in common and so their similarities would be very low.

So taking this all the way back to retrieval augmented generation, and being able to pull relevant pieces of text to give to your model via the prompt. We have a large corpus or knowledge base of documents with which we want to pull from. If we chunk up these documents into bite-size pieces of text, and then we embed those text into vectors stored in a vector database. We can query the vector database with the embedding from a question from the user, calculate the semantic vector similarities between this new question and all of the vectors stored within the database, and pull, say, the top five most relevant passages from your database across all of your documents that are the pertinent to the question, so that those passages can be entered into the prompt the LLM can take those passages and the question and then provide an informed answer.

So, in summary, LLMs lack domain specificity. You can give them domain specificity by fine tuning them and changing their parameters and weights to incorporate this new information, but that’s expensive. You can also give LLMs domain specificity at prompt time by providing them relevant passages of information that contain parts of the answer with which the model can piece together the correct answer in an informed way. To provide relevant pieces of documentation from a large pool of documents, we chunk those documents up into small pieces, converting and representing those pieces of text as vectors in a high dimensional space that that contains semantic information, and then doing similarity searches between those recorded text vectors and the new questions being asked, in order to pull the vectors from the database that are the most relevant to the question being asked so they can be entered in the prompt, and that the large language models are able to answer questions in an informed way. That in a nutshell is how RAG works!

That covered a lot of math and technical and engineering details behind these LLMs and how they work, but that should be a good explanation of why RAG exists, the factors and design decisions behind it, and the interesting technologies that allow it to be such a popular pattern. Thank you for your time feel free to leave any comments or questions!

I also apologize for the hiatus, a lot has been going on in my life recently.