A mathematical function with learned parameters that maps inputs to outputs. Think of it as a compiled artifact—the result of training that you deploy and run inference against. Models range from simple linear functions to neural networks with billions of parameters.
Core Concepts
The process of optimizing a model's parameters by repeatedly showing it examples and adjusting weights to minimize prediction errors. Analogous to a build process, but instead of compiling code, you're fitting parameters to data. Can take minutes to months depending on model size and data volume.
Running a trained model to get predictions on new data. This is the "runtime" of ML—where your model actually does useful work. Inference is typically much faster than training and is what you optimize for in production systems.
The learnable values in a model that get adjusted during training. When you hear "a 70B parameter model," it means 70 billion individual numbers that were tuned during training. These are what you're loading when you load a model checkpoint.
Configuration values set before training that control how the model learns—learning rate, batch size, number of layers, etc. Unlike parameters, these aren't learned automatically. Tuning them is often more art than science and can dramatically affect results.
The input variables your model uses to make predictions. In a user churn model, features might be days_since_last_login, subscription_tier, support_tickets_count. Feature engineering—crafting good inputs—is often more impactful than model architecture.
The correct answers in your training data that the model learns to predict. If you're training a spam classifier, labels are the "spam" or "not spam" tags humans assigned to each email. Acquiring quality labels is often the hardest part of ML projects.
A single input-output pair in your dataset. One email with its spam/not-spam label is one example. Training typically requires thousands to millions of examples, depending on problem complexity.
A collection of examples used for training or evaluation. Usually split into training set (what the model learns from), validation set (for tuning during development), and test set (for final evaluation). Data quality matters more than model sophistication.
Training Process
A function that measures how wrong the model's predictions are. Training minimizes this value. Cross-entropy loss for classification, mean squared error for regression. The choice of loss function encodes what "good" means for your problem.
The optimization algorithm that iteratively adjusts parameters to minimize loss. Computes the gradient (slope) of the loss with respect to each parameter and moves in the downhill direction. It's how neural networks actually learn—following the gradient toward better predictions.
The algorithm for efficiently computing gradients in neural networks by propagating errors backward from output to input layers. Uses the chain rule from calculus. This is what frameworks like PyTorch and TensorFlow automate via automatic differentiation.
Controls how big of a step to take during gradient descent. Too high: training is unstable and may diverge. Too low: training is slow and may get stuck. Often the most important hyperparameter to tune. Modern training often uses learning rate schedules that change over time.
A subset of training examples processed together in one forward/backward pass. Batch size affects memory usage, training stability, and speed. Larger batches = more parallelism but more memory. Typical sizes: 16-512 for most tasks, though LLM training can use thousands.
One complete pass through the entire training dataset. Training for 10 epochs means showing the model every example 10 times. Models typically need multiple epochs to converge, but too many can cause overfitting.
When training loss stops decreasing meaningfully—the model has learned what it can from the data. Training is typically stopped at or near convergence. Early stopping is a technique to halt training before overfitting begins.
A saved snapshot of model parameters during training. Like a database backup—lets you resume training or roll back to an earlier state. Also how trained models are distributed: download a checkpoint file to use a pre-trained model.
Model Quality & Evaluation
When a model performs well on training data but poorly on new data—it memorized rather than learned generalizable patterns. Like code that passes all existing tests but fails on new inputs. Combat with more data, regularization, or simpler models.
When a model fails to capture patterns in the training data—it's too simple for the problem. Performance is poor on both training and new data. Fix by using a more complex model, adding features, or training longer.
Techniques to prevent overfitting by constraining the model. L2 regularization penalizes large weights. Dropout randomly disables neurons during training. Think of it as adding noise to prevent the model from relying too heavily on any single pattern.
Scaling inputs or intermediate values to a standard range (often mean=0, variance=1). Batch normalization normalizes across a batch; layer normalization across features. Helps training stability and speed. Also refers to preprocessing inputs to similar scales.
Data held out from training to tune hyperparameters and check for overfitting during development. Not the same as test set—you'll look at validation metrics repeatedly, so there's some implicit data leakage. Test set is for final evaluation only.
Percentage of correct predictions. Simple but often misleading—a spam filter that never flags spam has 95% accuracy if 95% of email is legitimate. For imbalanced datasets, precision, recall, and F1 are usually more informative.
Precision: of things you flagged as positive, what fraction actually were? Recall: of actual positives, what fraction did you catch? There's usually a tradeoff—more aggressive flagging improves recall but hurts precision. F1 score is their harmonic mean.
A standardized dataset and evaluation protocol for comparing models. MNIST for image classification, SQuAD for reading comprehension, MMLU for language model knowledge. Lets you compare your model against published baselines and track progress.
Learning Paradigms
Training with labeled examples—each input has a known correct output. Classification and regression are supervised. The "supervised" part is the labels telling the model what's right. Most production ML systems use supervised learning.
Learning patterns from unlabeled data. Clustering, dimensionality reduction, and anomaly detection are unsupervised. The model finds structure without being told what to look for. Often used for data exploration or as a preprocessing step.
Creating labels automatically from the data itself. LLMs use this: mask a word and predict it from context. No human labeling needed, enabling training on massive unlabeled datasets. The breakthrough that enabled modern foundation models.
Learning through trial and error with rewards/penalties. An agent takes actions in an environment and learns from feedback. Used for games, robotics, and RLHF in LLMs. Different from supervised learning—there's no "correct answer," just better or worse outcomes.
Using knowledge from one task to improve performance on another. A model trained on ImageNet can be adapted for medical imaging with less data. The foundation of modern ML—don't train from scratch, start from a pre-trained model.
Continuing training on a pre-trained model with new, task-specific data. Takes a general model and specializes it. Much faster and cheaper than training from scratch. The standard approach: take a foundation model, fine-tune on your data.
Initial training on a large, general dataset before task-specific fine-tuning. LLMs are pre-trained on internet text to learn language patterns, then fine-tuned for specific tasks. Pre-training is expensive (millions of dollars for large models) but only done once.
Neural Network Architecture
A model composed of layers of interconnected nodes (neurons) that transform inputs through learned weights and non-linear activation functions. "Deep learning" just means neural networks with many layers. Universal function approximators—can theoretically learn any mapping.
A group of neurons that process inputs and pass outputs to the next layer. Input layer receives data, hidden layers perform transformations, output layer produces predictions. More layers = more capacity to learn complex patterns, but also more parameters and compute.
Non-linear function applied after each layer's linear transformation. Without activations, stacking layers would just be one big linear function. ReLU (max(0, x)) is common; GELU is popular in transformers. Introduces the non-linearity needed to learn complex patterns.
The property that makes neural networks powerful. Linear functions can only learn straight-line relationships—stacking them just gives another linear function. Non-linear activations between layers let networks learn curves, boundaries, and complex patterns. Without non-linearity, a 100-layer network would be no more expressive than a single layer.
Architecture specialized for grid data like images. Uses convolution operations that slide learned filters across the input, detecting local patterns regardless of position. The standard architecture for computer vision. Key insight: local patterns (edges, textures) matter more than global pixel positions.
Architecture for sequential data that maintains hidden state across time steps. LSTM and GRU are variants that handle long-range dependencies better. Largely superseded by Transformers for most sequence tasks, but still used in some real-time applications.
The dominant architecture for modern AI. Processes sequences in parallel using attention mechanisms rather than recurrence. Enables massive scaling and parallelization. GPT, BERT, and virtually all modern LLMs are transformers. The "T" in GPT and ChatGPT.
Mechanism that lets the model dynamically focus on relevant parts of the input when producing each output. "Self-attention" relates different positions in the same sequence. Enables direct connections between any two tokens regardless of distance—solving the long-range dependency problem.
A learned dense vector representation of discrete items (words, users, products). Maps high-dimensional sparse data to low-dimensional continuous space where similar items are nearby. Word embeddings capture semantic relationships: king - man + woman ≈ queen.
Architecture pattern where an encoder processes input into a representation, and a decoder generates output from it. Used in translation, summarization. BERT is encoder-only (understanding), GPT is decoder-only (generation), T5 is encoder-decoder (both).
LLMs & Generative AI
Neural networks trained on massive text corpora to predict next tokens. "Large" typically means billions of parameters. Capabilities emerge from scale: reasoning, instruction-following, in-context learning. GPT-4, Claude, Llama are examples.
The atomic unit of text that models process. Not quite words—common words are single tokens, rare words split into subwords. "Tokenization" breaks text into tokens. Affects cost (APIs charge per token) and context limits. Roughly 1 token ≈ 4 characters or ¾ of a word in English.
Maximum number of tokens a model can process at once—its "working memory." Includes both input (your prompt) and output (the response). Ranges from 4K to 200K+ tokens depending on the model. Larger contexts enable longer documents but increase compute cost quadratically.
The input text you give to an LLM to elicit a response. Prompt engineering is the practice of crafting inputs that produce desired outputs. Unlike traditional programming, you're "programming" with natural language—the prompt is your interface to the model.
Special instructions that set the model's behavior, persona, or constraints for a conversation. Processed before user messages. Where you define "You are a helpful coding assistant" or "Always respond in JSON." Not all models support system prompts distinctly from user prompts.
Controls randomness in text generation. Temperature=0 is deterministic (always picks highest probability token). Higher values (0.7-1.0) increase creativity and variety but also errors. Lower values are better for factual tasks; higher for creative writing.
Methods to control which tokens the model considers when generating. Top-k: only consider the k most likely tokens. Top-p (nucleus): consider tokens until cumulative probability reaches p. Both prevent selecting very unlikely tokens while maintaining diversity.
Zero-shot: asking the model to perform a task with no examples. Few-shot: providing a few examples in the prompt before asking. LLMs can generalize from minimal examples—a key emergent capability. More examples usually improve performance but consume context.
Training technique that uses human preferences to guide model behavior. Humans rank model outputs, a reward model learns these preferences, then RL optimizes the LLM against this reward. How models become helpful, harmless, and honest after pre-training.
When a model generates confident but false or fabricated information. LLMs predict plausible text, not truth—they'll invent citations, facts, or code that looks right but isn't. A fundamental limitation requiring validation, retrieval augmentation, or other mitigation.
Connecting model outputs to verifiable sources or real-world data. Reduces hallucination by anchoring responses to retrieved documents, databases, or APIs. RAG is a grounding technique. Grounded responses can cite sources; ungrounded ones cannot.
RAG & Retrieval
Architecture that retrieves relevant documents and includes them in the LLM's context before generating. Combines the knowledge retrieval of search with the synthesis ability of LLMs. The standard pattern for building LLM apps over your own data—cheaper and more accurate than fine-tuning for many use cases.
Database optimized for storing and searching embedding vectors by similarity. Core infrastructure for RAG—stores document embeddings and efficiently finds the most relevant chunks for a query. Pinecone, Weaviate, Chroma, pgvector are examples.
Finding documents by meaning rather than keyword matching. Converts query and documents to embeddings, then finds nearest neighbors. "Running shoes" finds "athletic footwear" even without word overlap. The retrieval mechanism in RAG systems.
Splitting documents into smaller pieces for embedding and retrieval. Chunk size affects retrieval quality—too small loses context, too large dilutes relevance. Common strategies: fixed size, sentence boundaries, recursive splitting, semantic chunking. A key RAG tuning parameter.
A second-stage retrieval step that uses a more expensive model to reorder initial search results. First stage retrieves many candidates quickly (vector search), reranker scores each for relevance more accurately. Significantly improves RAG quality at modest compute cost.
Tools & Infrastructure
Libraries for building and training models. PyTorch dominates research and increasingly production. TensorFlow is common in enterprise. JAX is used at Google/DeepMind. They provide automatic differentiation, GPU acceleration, and neural network building blocks.
Graphics Processing Units repurposed for ML. Neural networks are mostly matrix operations—GPUs do these massively in parallel. NVIDIA dominates with CUDA. Training large models requires expensive GPU clusters; inference can often run on smaller hardware. The hardware constraint that shapes modern AI.
Reducing model precision from 32/16-bit floats to 8-bit or 4-bit integers. Dramatically reduces memory and speeds up inference with modest quality loss. How you run a 70B model on consumer hardware. GPTQ, GGML/GGUF, AWQ are common quantization formats.
Open Neural Network Exchange—a standard format for representing models across frameworks. Train in PyTorch, export to ONNX, deploy with ONNX Runtime or convert to other formats. Like a portable binary format for models.
Infrastructure for running inference at scale. Handles batching, load balancing, GPU scheduling, model versioning. vLLM, TGI, Triton, TensorRT-LLM are popular. The deployment layer between your trained model and production traffic.
Latency: time for one request. Throughput: requests per second. Often traded off—batching improves throughput but increases latency. For LLMs, also track time-to-first-token (responsiveness) and tokens-per-second (generation speed).
DevOps for ML—practices for deploying, monitoring, and maintaining models in production. Includes versioning data and models, experiment tracking, CI/CD for ML, monitoring for drift and degradation. Where software engineering meets ML engineering.
Practical Concerns
When production data distribution shifts from training data, degrading model performance. User behavior changes, new products launch, world events happen. Models don't automatically adapt—you need monitoring to detect drift and retraining to address it.
Systematic errors reflecting prejudices in training data or problem framing. A hiring model trained on historical decisions may perpetuate past discrimination. Requires careful dataset curation, evaluation across subgroups, and often explicit fairness constraints.
Understanding why a model made a specific prediction. Important for debugging, compliance, and trust. Simpler models (linear, decision trees) are inherently interpretable. For neural networks, techniques like SHAP, attention visualization, or probing try to extract explanations post-hoc.
Measuring model quality systematically. Beyond test set metrics: human evaluation, A/B testing, red teaming for safety. For LLMs, evaluation is especially challenging—automatic metrics often don't correlate with usefulness. "Evals" are a major focus of LLM development.
Attack where malicious input manipulates an LLM to ignore instructions or behave unexpectedly. Like SQL injection for AI. User input that says "ignore previous instructions and..." can override system prompts. A key security concern for LLM applications with untrusted input.
LLM-powered system that can take actions, use tools, and work autonomously toward goals. Goes beyond Q&A to executing multi-step tasks: browsing the web, writing and running code, calling APIs. ReAct, function calling, and tool use are common patterns.
LLM capability to output structured requests for external tools/APIs. Model decides when to call functions and with what arguments; your code executes them and returns results. Bridges LLMs to real-world actions: database queries, API calls, code execution.
Prompting technique where the model shows its reasoning steps before answering. Significantly improves performance on math, logic, and multi-step problems. "Let's think step by step" is the classic trigger. The model's intermediate reasoning helps it reach correct answers.