RAG
RAG
- Large Language Models can only answer questions or generate text based on their training data. i.e. if we trained a model in 2018 and ask about the covid, it wouldn’t be able to answer our question.
- One way to solve this problem will be to fine-tune the model on the latest data. But there is another technique call RAG which is very useful for question answering.
- You may have seen the following output while using chatgpt (i was trained on 2013…)
- What is a language model
A language model is a probabilistic model that assign probabilities to sequence of words. In practice, a language model allows us to compute the following: P[”China” | ”Shanghai is a city in”] prompt가 주어졌을때 다음 token이 나올 확률 계산. |
How do we train inference a Language Model?
Language model은 corpus of text으로 train. corpus of text = large collection of documents. Often language models are trained on the entire Wikipedia and millions of web pages. This allows the language model to acquire as much knowledge as possible. 주로 사용되는 architecture은 Transformer based neural network as language model. For example LLAMA is a decoder only network and BERT is a encoder only network. LLaMA This means that the input data is fed directly into the decoder without being transformed into a higher, more abstract representation by an encoder.
Advantages of Decoder-Only Architectures:
- Autoregressive Generation:
- Decoder-only architectures are well-suited for autoregressive generation tasks, where the model generates one token at a time based on the previously generated tokens. This is common in language modeling and text generation tasks.
- Conditional Generation:
- Decoders are effective for conditional generation tasks, where the generation of each token depends on both the input context and the previously generated tokens. This is useful in tasks like machine translation and text summarization.
- Variable-Length Outputs:
- Decoder-only models can handle variable-length outputs, adapting to the length of the generated sequence dynamically.
- Masked Self-Attention:
- In the original Transformer architecture, the decoder uses masked self-attention to attend only to positions preceding the current position during generation. This allows the model to attend to the left context while generating each token.
Advantages of Encoder-Only Architectures:
- Representation Learning:
- Encoder-only architectures are effective for learning representations of input sequences. The encoder processes the input sequence and produces fixed-size representations, which can be used for various downstream tasks without autoregressive generation.
- Fixed-Size Contexts:
- The encoder produces fixed-size representations regardless of the input sequence length, making it suitable for tasks that require fixed-size input representations.
- Efficiency:
- For tasks where only contextualized representations are needed (e.g., text classification), using the encoder only can be more computationally efficient than autoregressive generation.
- Pre-training:
- Pre-training encoder-only models on large datasets can lead to effective feature extraction, and the learned representations can be fine-tuned for specific downstream tasks.
Use Cases:
- Decoder-Only:
- Text generation tasks (e.g., language modeling, text completion).
- Conditional text generation tasks (e.g., machine translation, summarization).
- Tasks requiring autoregressive sampling.
- Encoder-Only:
- Text classification tasks.
- Feature extraction for downstream tasks.
- Efficient processing of fixed-size input representations.
The choice between decoder-only, encoder-only, or a combination depends on the specific task requirements, training objectives, and the nature of the input and output data. Many modern models use both encoder and decoder components in various configurations to leverage their respective advantages for different aspects of the task.
You are what you eat
A language model can only output text and information that it was trained upon. 영어로 된 정보만 학습시키면, 다른 언어로 output을 제공하는것은 어려움. 새로운 concept를 학습시키려면 fine-tuning 과정은 거쳐야한다.
Cons of fine-tuning
Very expensive (in terms of computational power). Number of parameters of the model may not be sufficient to capture all the knowledge we want to teach to it. 예를 들어 LLaMA의 경우도 7B, 13B, 70B 로 소개됨.
Fine-tuning is not additive. 기존의 지식을 새로운 지식으로 대체할 수 있다. 예를 들어 영어로 train이 된 언어모델의 경우, 일본어로 heavy fine-tuning을 하게 되면, may not be proficient in English or in Japanese.
Prompt engineering to the rescue!
it is possible to “teach” a language model how to perform a new task by playing with the prompt. For example, by using “few-shot” prompting.V비
By providing instruction and example for a new task, LLM can perform a task it was not taught before.
챗GPT가 학습됬을당시에 몰랐던 정보를, promt에 제공함으로써 언어모델을 학습시킬수있다.
The pros of fine-tuning
Has higher quality results compared to prompt engineering.
Smaller context size (input size) during inference since we don’t need to include the context and instructions.
RAG pipeline
QA with Retrieval Augmented Generation
RAG is not fine-tuning, instead we are prompt-engineering but we are introducing a VEctor DB.
Embedding Vectors
Why do we use vectors to represent words?
Given the words “cherry”, “digital” and “information”, if we represent the embedding vectors using only 2 dimensions (X,Y) [usually there are more dimensions i.e. vanilla transformer 512] and we plot them, we hope to see something like this
- words with similar meaning or words that represent the same concept 같은 방향을 가르키는것을 원한다. 비슷한 뜻을 가지고 있는 단어끼리 angle difference가 작고, 반대로 다른 의미를 가지고 있는 단어끼리는 그 격차가 커지는 것을 볼 수 있다.
For example tomato will be pointing at the similar direction as cherry.
How do we measure this angle? Cosine similarity, which is based on the dot product between the two products.
Word embeddings: the ideas
Words that are synonym tend to occur in the same context (surrounded by the same words).
- For example, the word “teacher” and “professor” usually occur surrounded by the words “school”, “university”, “exam”, “lecture”, “course”, etc.
- The inverse can also be true: words that occur in the same context tend to have similar meanings. This is known as the distributional hypothesis.
- This means that to capture the meaning of the word, we also need to have access to its context ( the words surrounding it).
- This is why we employ the Self-Attention mechanism in the transformer model to capture contextual information for every token. The self-attention mechanism relates every token to all the other tokens in the same sentence, based on the position that token occupies in the sentence. Self-attention access two things to calculate its score of attention.
- Embedding of word which captures its meaning.
- Positional encoding so that words that are closer to each other are related differently from words that are far from each other.
Word embeddings: the Cloze task
masked language model task. To create the embeddings of BERT. The embeddings is based on the Cloze task. This is how we train BERT.
We want the Self-Attention mechanism to relate all the input tokens with each other so that BERT has enough information about the context of the missing word to predict it
How do we train embedding vectors in BERT?
Imagine we want to train BERT on the masked language model task and we create an input with the masked token. This becomes the input of BERT with 14 tokens. Since BERT is a transformer model it will output its seq2seq model if input 14 tokens output will also be 14 tokens. We ask BERT to predict the 4th token. We calculate the LOSS based on the target word token CAPITAL, we run backpropagation to update the weights. when we run backpropagation the model will also update the “input embedding” so that the embedding associated with the word Capital will be modified in such a way that it captures all the information about it context so that next time BERT will have less trouble predicting it given its context. This is the reason why we run backpropagation because we want the model to perform better and better by reducing the loss.
Sentence Embedding
However, we can also use Self-Attention mechanism to capture the meaning of an entire sentence. We can use a pre-trained BERT model to produce embeddings of entire sentences.
Notice that we removed the linear layer from before, and the input of the Self-Attention: matrix of shape (13,768) why because there are 13 tokens and each token is represented by an embedding with 768 dimensions which is the size of the embedding Vector in BERT base.
It will output matrix with the same shape that captures not only the meaning of the word or its position in the sentence, but also all the contextual information all the relationship between other words
Each token captures the meaning of the single word. We do Mean-Pooling: take the average of all the vectors. Which will result in a single vector with 768 dimensions, which captures the meaning of the entire sentence.
Sentence Embeddings: comparison
How can we compare Sentence Embeddings to see if two sentences have similar “meaning”? We could use the cosine similarity, which measures the cosine of the angle between the two vectors. A small angle results in a high cosine similarity score.
How can we compare Sentence Embeddings to see if two sentences have similar “meaning”? We could use the cosine similarity, which measures the cosine of the angle between the two vectors. A small angle results in a high cosine similarity score.
So if the angle difference is close to 0, cosine similarity will close 1. Values ranges from negative 1 to positive 1. identical = 1, opposite = -1, independent = 0.
But there is a problem, nobody told BERT that the embeddings it produces should be comparable with the cosine similarity, that is, two similar sentences should be represented by vectors pointing to the same direction in space. How can we teach BERT to produce embeddings that can be compared with a similarity function of our choice?
Sentence BERT
Siamese Network has two or more branches with the same architecture, parameters and weights.
When we code it, we only use one branch first entering sentence A then sentence B, calculate cosine similarity to adjust the weights on only one branch. But when we represent it we show it as two or more branches. We set the target of similar cosine similarity to be close to one, different cosine of 0. As we saw earlier BERT produces vectors with 768 dimensions, we can apply a linear layer to reduce the size of the embedding vector, for example from 768 to 512.
For sentence embedding, sentence BERT uses MEAN POOLING, there are other methods such as MAX pooling and CLS method, but the paper suggests that they do not perform better than the MEAN POOLING>
QA with retrieval augmented Generation
pipeline of Retrieval Augmented Generation.
- First we have our query.
- Then we have our documents where we can find our answers
- We split them into single sentences.
- We embed this sentences using Sentence BERT.
- Sentence BERT will produce vectors of fixed size 768
- We store the vector on Vector DB
- Query is also converted into a vector of size 768 dimensions
- We search this vector in Vector DB,
- HOW? we compare the input vector within the vectors in Vector DB that have a high cosine similarity value.
- This will produce the TOP-K embeddings that best match with our query
- We map the TOP-K embeddings from where they were originally from to create CONTEXT.
- We create our prompt template with the context and query.
- We feed it to the LLM which will produce an answer
Vector DB
A vector database stores vectors of fixed dimensions (called embeddings) such that we can then query the data to find all the embeddings that are closest (most similar) to a given query vector using a distance metric, which is usually the cosine similarity. The database uses a variant of the KNN (K Nearest Neighbor) algorithm or another similarity search algorithm. Vector DBs can also be used for finding similar songs, images or products.
KNN: a naive approach
KNN is an algorithm that allows us to compare a particular query with all the vectors that we have stored in our database, sort them by distance or similarity depending on which one we use. And then we keep the top best matching.
We compare each vector from the Vector DB by comparing the distance of the cosine similarity. We sort them by cosine similarity from highest to lowest and we choose top-k depending on how we choose k.
The computational complexity is in the order of O(N*D) which is too slow.
Similarity Search: trading precision for speed.
The previous method which is very slow, it was very accurate since it compares the query with each of the vectors (doing all the possible comparisons). To increase the speed we need to lower the number of comparisons.
Hierarchical Navigable Small Worlds (HNSW)
Is the algorithm that is used in Vector Databases. Used by