Optimizing AI models for Production Environment

on April 06, 2026 in AI, LLM, OPenAI

We can LLMs in three ways by usually

1. Encode text into semantic vectors with little/no file tuning

2. Fine tune a pre-trained LLM to perform a very specific task using by Transfer Learning

3. Query an LLM to solve a task which was pre-trained or could intuit.

Two types of LLMs now.

1) Auto encoding LLMs - Learn a entire sequence by predicting tokens (words) given past and future context. It is best for classification and embedding + retrieval tasks. [Example BERT]

2) Auto regressive LLMs : It will predict a future token

LLMs excel at task that require reasoning using context and input information in the conjunction to produce a nuanced answer.

AI agents are semi autonomous systems that interact with environment, make decisions and perform tasks on behalf of users.

Autonomy - They can perform tasks without continuous human intervention.

Decision Making - Use data to analyze and choose actions

Adaptability - Learn and improve over time with feedback.

Optimizing Models:

Speculative Decode : Using an assistant model to guide next token perdition

Caching OS models : Implementing prompt caching with open Source models

Quantization : Reducing computation requirement of neural network.

Distillation : Transfer knowledge from large model into small through targeted fine tuning.

Speculative Decoding:

Assistant agent calls for forward method of calling [calling parameter over and over again]. The main model simply verifies which token is agreed with request.