Notes and implementations based on learnings in Udemy LLM Course.
Install Anaconda via Conda Docs. Run after:
- Run
open ~/.zshrc
and addexport PATH=$PATH:$HOME/anaconda3/bin
. cd
to the project dir.- Run
conda env create -f environment.yml
to create the environment. - Run
conda init && conda activate && conda activate llms
to kickstart the virtual environment.
- Run
python3.11 -m venv llms
. Python 3.11 is the only compartible one at the moment (August 2025). - Run
source llms/bin/activate
instead ofconda init && conda activate && conda activate llms
. - Run
python3.11 -m pip install --upgrade pip
to upgrade pip. - Run
pip install -r requirements.txt
to install requirements. - Run
jupyter lab
. - Run
cat > .env
to create an environments file and paste in your Open AI keyOPENAI_API_KEY=sk-proj-xyz
. Get the key from OpenAI Org Billing page - invest $5. It will look like this.OPENAI_API_KEY=sk-proj-xyz GOOGLE_API_KEY=xyz ANTHROPIC_API_KEY=xyz DEEPSEEK_API_KEY=xyz HF_TOKEN=xyz
- Run these:
pip install selenium pip install playwright playwright install # to install browser binaries after installing playwright pip install nest_asyncio
- Run
xattr -d com.apple.quarantine drivers/chromedriver-mac-x64/chromedriver
for Mac/Apple to authorize chromedriver.
- Visit ollama.com and install! Follow http://localhost:11434/ to find a message
"Ollama is running"
. - Otherwise, run
ollama serve
. - In another Terminal, run
ollama pull llama3.2
. - Then revisit http://localhost:11434/.
- If Ollama is slow on your machine, try
MODEL = "llama3.2:1b"
. - To make sure the model is loaded, run
!ollama pull llama3.2
.
Of course! Here is an improved version of your Markdown content. It has been restructured for better flow, clarity, and readability, with enhanced formatting and more cohesive explanations.
This guide offers a foundational understanding of Large Language Models (LLMs), from their core technical concepts to the tools, techniques, and trends shaping the industry.
A Large Language Model (LLM) is a sophisticated type of artificial intelligence (AI) designed to understand, process, and generate human-like text. The "large" in its name refers to the immense scale of both its architecture and the data it's trained on. The capability of an LLM is often measured by its number of parameters—internal variables the model learns during training and uses to make predictions. Generally, a higher parameter count indicates a more powerful and complex model.
The growth in model size has been exponential:
- GPT-1 (2018): 117 million parameters
- Llama 3.1 (405B version): 405 billion parameters
- GPT-4 (estimate): Over 1 trillion parameters
LLMs don't process text as whole words or sentences. Instead, they break it down into smaller units called tokens.
- Tokens: The fundamental building blocks of language for an AI.
- Rule of Thumb: 1 token is roughly 4 characters or ¾ of a word. This means 1,000 tokens equate to approximately 750 words.
- The Tokenizer: An essential component that maps text to a sequence of tokens (
encode
) and tokens back to text (decode
). Each model has its own tokenizer and a specific vocabulary (the set of all possible tokens it understands).
The context window is the maximum number of tokens an LLM can process at once. This includes both the input prompt and the generated output. The size of the context window dictates how much information the model can "remember" in a single conversation or task.
The foundation for nearly all modern LLMs is the Transformer architecture, introduced by Google in 2017. Its key innovation was processing entire sequences of data in parallel, which dramatically accelerated training compared to older, sequential models like Recurrent Neural Networks (RNNs).
Key components of the Transformer include:
- Self-Attention Mechanism: This is the core concept. It allows the model to weigh the importance of different tokens within the input text relative to each other, regardless of their position. This is how LLMs capture context and understand complex relationships in language.
- Encoder-Decoder Structure: The original design includes an encoder to create a rich numerical representation of the input and a decoder to generate the text output from that representation.
- Multi-Head Attention: An enhancement where multiple self-attention mechanisms run in parallel, each focusing on different aspects of the token relationships, creating a more nuanced understanding.
- Positional Encoding: Since tokens are processed in parallel, the model needs a way to understand word order. Positional encodings are added to the token representations (embeddings) to provide this crucial positional information.
Developing effective LLM applications relies on three pillars: the models themselves, the tools to build with them, and the techniques to optimize their performance.
The model is the foundational engine of any AI application.
- Open-Source Models: Freely available for modification and self-hosting. Examples include Meta's Llama family, Mistral's Mixtral, and Google's Gemma.
- Closed-Source Models: Proprietary models accessed via APIs. Examples include OpenAI's GPT series, Anthropic's Claude, and Google's Gemini.
- Multimodal Models: Capable of processing and generating information across different data types, such as text, images, audio, and video.
These are the frameworks and infrastructure used to build, train, and deploy LLMs.
- Hugging Face: The central hub for the open-source AI community, offering:
- Model Hub: Over 1.9 million pre-trained models.
- Datasets: Over 200,000 datasets for training and evaluation.
- Spaces: A platform for hosting and sharing AI demos.
- Libraries: Essential tools like
transformers
,datasets
,peft
(Parameter-Efficient Fine-Tuning), andaccelerate
.
- Development Frameworks:
- LangChain & LlamaIndex: Help build complex, data-aware applications by chaining LLM calls with other data sources and APIs.
- Experiment Tracking:
- Weights & Biases: A platform for visualizing and tracking machine learning experiments.
- Development Environments:
- Google Colab: A cloud-based Jupyter Notebook environment that provides free access to GPUs, making it ideal for experimenting and training models.
These are methods for eliciting the best possible performance from an LLM.
- Prompting: The art of crafting effective instructions for the model.
- Zero-shot: The model responds to a task without any examples.
- One-shot / Few-shot: The prompt includes one or more examples to guide the model's response.
- Retrieval-Augmented Generation (RAG): Enhances model accuracy by retrieving relevant information from an external knowledge base (like a company's internal documents) and adding it to the prompt as context. This helps reduce hallucinations and keeps information current.
- Fine-Tuning: Adapting a pre-trained model to a specific task or domain by training it further on a smaller, specialized dataset.
- Quantization: A compression technique that reduces the memory footprint of a model, allowing it to run on less powerful hardware.
- Agentization: Building autonomous agents that can use LLMs to reason, plan, and execute multi-step tasks by interacting with external tools (e.g., APIs, databases).
Choosing the right model requires a careful analysis of its capabilities, costs, and performance on relevant benchmarks.
Model Basics:
- Source: Open-source or closed-source?
- Parameters: How large is the model? This impacts capability and fine-tuning costs.
- Training Data: What was the size (in tokens) and knowledge cut-off date of its training corpus?
- Context Length: What is the maximum size of the context window?
Operational Factors:
- Costs (Inference & Training): Consider API costs for proprietary models versus compute costs for self-hosted open-source models. Fine-tuning introduces additional training costs.
- Speed & Latency: How many tokens per second can it generate (throughput), and how long does it take to get the first token (latency)?
- Rate Limits & Reliability: API-based models often have usage limits and can experience downtime.
- Licensing: Always check the license for commercial use restrictions, especially for open-source models.
A key finding from DeepMind suggests that for optimal performance, the model size (number of parameters) and the training data size (number of tokens) should be scaled proportionally. A model can be undertrained if it's too large for its dataset.
Benchmarks are standardized tests used to rank and compare LLMs across various capabilities.
- MMLU: Massive Multitask Language Understanding.
- GSM8K: Grade-school math word problems.
- HumanEval: Python code generation.
- ARC: Scientific reasoning challenges.
- HellaSwag: Common-sense reasoning.
- TruthfulQA: Measures a model's tendency to generate factual, non-hallucinated answers.
- ELO Rating: A system (often used in leaderboards like the Chatbot Arena) where human raters compare model outputs side-by-side.
- Inconsistent Application: Different providers may use slightly different methodologies.
- Data Contamination: Benchmark questions may have leaked into the models' training data, inflating scores.
- Overfitting: Models can be "tuned" to perform well on specific benchmarks without generalizing the underlying skill.
- Narrow Scope: Multiple-choice questions often fail to capture nuanced reasoning abilities.
New, harder tests are emerging to push the limits of frontier models:
- GPQA: Graduate-level questions designed to be "Google-proof."
- MMLU-PRO: A more challenging version of the original MMLU.
- IFEval: Tests a model's ability to follow complex instructions.
These platforms aggregate benchmark results to help with model selection:
- Hugging Face: Open LLM Leaderboard and others for specific tasks.
- LMSYS: Chatbot Arena Leaderboard based on human preferences.
- Vellum: LLM Leaderboard with comparisons of cost and context windows.
- Scale: SEAL Leaderboards for expert-level evaluation.
LLMs are being integrated into specialized professional tools across industries:
- Law: Harvey.ai provides AI assistance for legal work.
- Hiring: Nebula.io assists in talent acquisition.
- Software Engineering: Bloop.ai helps understand and migrate legacy codebases.
- CRM: Salesforce's Einstein Copilot integrates generative AI into customer relationship management.
The performance of a GenAI solution can be evaluated from two perspectives:
- Model-Centric (Technical) Metrics: Easier to optimize directly.
- Loss / Perplexity: Measures how "surprised" a model is by the correct answer.
- Accuracy, Precision, Recall, F1: Standard classification metrics.
- Business-Centric (Outcome) Metrics: Measure the tangible impact.
- ROI: Return on investment.
- KPIs: Improvements in time, cost, customer satisfaction, or other key business indicators.
Even the most advanced LLMs have fundamental weaknesses:
- Knowledge Cut-off: Their knowledge is frozen at the end of their training date, so they are unaware of recent events.
- Hallucinations: They can generate plausible but incorrect or nonsensical information with high confidence.
- Niche Domains: They may lack the deep, specialized expertise required for highly technical or niche fields.
- LangChain is a framework created in late 2022 to simplify building LLM applications by chaining functionalities.
- It standardizes and simplifies retrieval-augmented generation (RAG) workflows, enabling quick time to market.
- LangChain provides wrappers around common LLM APIs, allowing easy switching between models like OpenAI and Claude.
- Despite decreasing need due to maturing APIs, LangChain remains a valuable tool for loading, chunking, and vectorizing knowledge bases efficiently.
- LangChain significantly simplifies creating applications for common tasks like assistance and retrieval-augmented generation (RAG).
- Document/Text splitting with LangChain (https://python.langchain.com/docs/concepts/text_splitters/).
- How partitioning a PDF works (https://docs.unstructured.io/api-reference/partition/partitioning).
- How Chunking works (https://docs.unstructured.io/api-reference/partition/chunking).
- Unstructured is an open-source tool for converting documents to structured data effortlessly (https://github.com/Unstructured-IO/unstructured).
A lovely datastore for keeping vectors. We can visualize the data thereafter. For example:
# Use OpenAI Embeddings model for text representation.
embeddings = OpenAIEmbeddings()
# Create Chroma Vector Store
vectorstore = Chroma.from_documents(
documents=chunks, # populate the vectorstore with chunks (likely text segments)
embedding-embeddings, # the embeddings model
persist_directory=db_name # Persist the Vector Store to a local directory for later use
)
LangChain defines several abstractions that simplify building applications:
- LLM: Represents a language model, such as OpenAI's GPT. This abstraction encapsulates the model interface.
- Retriever: An interface to a vector store, such as Chroma, used for retrieval in RAG workflows. It enriches prompts by retrieving relevant documents.
- Memory: Represents the history of a conversation with a chatbot. It abstracts the underlying data structure, typically a list of messages, managing context across interactions.
For example:
# create a new Chat with OpenAI with a temperature setting of 0.7, influencing the creativity of the responses
llm = ChatOpenAI (temperature=0.7)
# set up the conversation memory for the chat - configured to store chat history, allowing the AI to maintain context throughout the conversation
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
# create a retriever from the Chroma datastore, enabling the retrieval of relevant information for generating responses
retriever = vectorstore.as_retriever()
# putting it together - The LLM, retriever, and memory are combined to form a ConversationalRetrievalChain, which facilitates interactive conversations with the AI, leveraging both its language generation capabilities and retrieved information.
conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)
For migrating DB, see https://python.langchain.com/docs/versions/migrating_memory/
- LCEL is a declarative language (using YML) that can be used as an alternative to the code approach.
- LCEL allows declarative setup of chains using YAML files.
- LCEL closely maps to Python code for defining models, memory, embeddings, vector stores, retrievers, and chains.
- Major challenge for you - your own private Knowledge Worker
- Create a Knowledge Worker on your information to boost productivity
- Assemble all your files in 1 place; your personal Knowledge Base
- Vectorize everything in Chroma - your vector datastore
- Build a Conversational Al and ask questions!
- Advanced ideas to take it to the next level
- If you use Google Workspace, use Google's API to read your own docs
- If you use MS Office, use libraries to read Office docs
- Harder - use libraries to connect to your email inbox, and Slack, and more!