Monday, April 6, 2026

Optimizing AI models for Production Environment

 



We can LLMs in three ways by usually

1. Encode text into semantic vectors with little/no file tuning
2. Fine tune a pre-trained LLM to perform a very specific task using by Transfer Learning
3. Query an LLM to solve a task which was pre-trained or could intuit.
Two types of LLMs now.
1) Auto encoding LLMs - Learn a entire sequence by predicting tokens (words) given past and future context.   It is best for classification and embedding + retrieval tasks. [Example BERT]
2) Auto regressive LLMs : It will predict a future token 
LLMs excel at task that require reasoning using context and input information in the conjunction to produce a nuanced answer.


AI agents are semi autonomous systems that interact with environment, make decisions and perform tasks on behalf of users.
Autonomy - They can perform tasks without continuous human intervention.
Decision Making - Use data to analyze and choose actions
Adaptability - Learn and improve over time with feedback.
Optimizing Models:
Speculative Decode : Using an assistant model to guide next token perdition
Caching OS models : Implementing prompt caching with open Source models
Quantization : Reducing computation requirement of neural network.
Distillation : Transfer knowledge from large model into small through targeted fine tuning.
Speculative Decoding:
Assistant agent calls for forward method of calling [calling parameter over and over again].  The main model simply verifies which token is agreed with request.





No comments:

Post a Comment