AI/ML Glossary for Software Engineers

Core Concepts

Model

A mathematical function with learned parameters that maps inputs to outputs. Think of it as a compiled artifact—the result of training that you deploy and run inference against. Models range from simple linear functions to neural networks with billions of parameters.

Training

The process of optimizing a model's parameters by repeatedly showing it examples and adjusting weights to minimize prediction errors. Analogous to a build process, but instead of compiling code, you're fitting parameters to data. Can take minutes to months depending on model size and data volume.

Inference aka: prediction

Running a trained model to get predictions on new data. This is the "runtime" of ML—where your model actually does useful work. Inference is typically much faster than training and is what you optimize for in production systems.

Parameters aka: weights

The learnable values in a model that get adjusted during training. When you hear "a 70B parameter model," it means 70 billion individual numbers that were tuned during training. These are what you're loading when you load a model checkpoint.

Scalar

A single number—a zero-dimensional value. In ML contexts, scalars are individual values like a loss value, learning rate, or a single pixel intensity. Contrast with vectors (1D arrays), matrices (2D arrays), and tensors (arbitrary dimensions). Most model metrics are scalars.

Tensor

A multi-dimensional array that generalizes scalars (0D), vectors (1D), and matrices (2D) to arbitrary dimensions. The fundamental data structure in deep learning—inputs, outputs, and parameters are all tensors. A batch of RGB images is a 4D tensor: [batch, height, width, channels]. PyTorch and TensorFlow are named after this structure.

Hyperparameters

Configuration values set before training that control how the model learns—learning rate, batch size, number of layers, etc. Unlike parameters, these aren't learned automatically. Tuning them is often more art than science and can dramatically affect results.

Features

The input variables your model uses to make predictions. In a user churn model, features might be days_since_last_login, subscription_tier, support_tickets_count. Feature engineering—crafting good inputs—is often more impactful than model architecture.

Labels aka: targets, ground truth

The correct answers in your training data that the model learns to predict. If you're training a spam classifier, labels are the "spam" or "not spam" tags humans assigned to each email. Acquiring quality labels is often the hardest part of ML projects.

Example aka: sample, data point

A single input-output pair in your dataset. One email with its spam/not-spam label is one example. Training typically requires thousands to millions of examples, depending on problem complexity.

Dataset

A collection of examples used for training or evaluation. Usually split into training set (what the model learns from), validation set (for tuning during development), and test set (for final evaluation). Data quality matters more than model sophistication.

Training Process

Loss Function aka: cost function, objective

A function that measures how wrong the model's predictions are. Training minimizes this value. Cross-entropy loss for classification, mean squared error for regression. The choice of loss function encodes what "good" means for your problem.

Gradient Descent

The optimization algorithm that iteratively adjusts parameters to minimize loss. Computes the gradient (slope) of the loss with respect to each parameter and moves in the downhill direction. It's how neural networks actually learn—following the gradient toward better predictions.

Momentum

An enhancement to gradient descent that accumulates velocity from past gradients, like a ball rolling downhill. Helps training move faster through flat regions and dampens oscillations in steep ones. Adam, the most popular optimizer, combines momentum with adaptive learning rates.

Backpropagation

The algorithm for efficiently computing gradients in neural networks by propagating errors backward from output to input layers. Uses the chain rule from calculus. This is what frameworks like PyTorch and TensorFlow automate via automatic differentiation.

Learning Rate

Controls how big of a step to take during gradient descent. Too high: training is unstable and may diverge. Too low: training is slow and may get stuck. Often the most important hyperparameter to tune. Modern training often uses learning rate schedules that change over time.

Batch

A subset of training examples processed together in one forward/backward pass. Batch size affects memory usage, training stability, and speed. Larger batches = more parallelism but more memory. Typical sizes: 16-512 for most tasks, though LLM training can use thousands.

Epoch

One complete pass through the entire training dataset. Training for 10 epochs means showing the model every example 10 times. Models typically need multiple epochs to converge, but too many can cause overfitting.

Convergence

When training loss stops decreasing meaningfully—the model has learned what it can from the data. Training is typically stopped at or near convergence. Early stopping is a technique to halt training before overfitting begins.

Checkpoint

A saved snapshot of model parameters during training. Like a database backup—lets you resume training or roll back to an earlier state. Also how trained models are distributed: download a checkpoint file to use a pre-trained model.

Model Quality & Evaluation

Overfitting

When a model performs well on training data but poorly on new data—it memorized rather than learned generalizable patterns. Like code that passes all existing tests but fails on new inputs. Combat with more data, regularization, or simpler models.

Underfitting

When a model fails to capture patterns in the training data—it's too simple for the problem. Performance is poor on both training and new data. Fix by using a more complex model, adding features, or training longer.

Regularization

Techniques to prevent overfitting by constraining the model. L2 regularization penalizes large weights. Dropout randomly disables neurons during training. Think of it as adding noise to prevent the model from relying too heavily on any single pattern.

Normalization

Scaling inputs or intermediate values to a standard range (often mean=0, variance=1). Batch normalization normalizes across a batch; layer normalization across features. Helps training stability and speed. Also refers to preprocessing inputs to similar scales.

Validation Set aka: dev set

Data held out from training to tune hyperparameters and check for overfitting during development. Not the same as test set—you'll look at validation metrics repeatedly, so there's some implicit data leakage. Test set is for final evaluation only.

Accuracy

Percentage of correct predictions. Simple but often misleading—a spam filter that never flags spam has 95% accuracy if 95% of email is legitimate. For imbalanced datasets, precision, recall, and F1 are usually more informative.

Precision & Recall

Precision: of things you flagged as positive, what fraction actually were? Recall: of actual positives, what fraction did you catch? There's usually a tradeoff—more aggressive flagging improves recall but hurts precision. F1 score is their harmonic mean.

Benchmark

A standardized dataset and evaluation protocol for comparing models. MNIST for image classification, SQuAD for reading comprehension, MMLU for language model knowledge. Lets you compare your model against published baselines and track progress.

Perplexity

A metric for how well a language model predicts text—lower is better. Mathematically, the exponentiated average negative log-likelihood per token. A perplexity of 10 means the model is as uncertain as choosing uniformly among 10 options at each step. Commonly used to compare language models and track training progress.

Learning Paradigms

Supervised Learning

Training with labeled examples—each input has a known correct output. Classification and regression are supervised. The "supervised" part is the labels telling the model what's right. Most production ML systems use supervised learning.

Unsupervised Learning

Learning patterns from unlabeled data. Clustering, dimensionality reduction, and anomaly detection are unsupervised. The model finds structure without being told what to look for. Often used for data exploration or as a preprocessing step.

Self-Supervised Learning

Creating labels automatically from the data itself. LLMs use this: mask a word and predict it from context. No human labeling needed, enabling training on massive unlabeled datasets. The breakthrough that enabled modern foundation models.

Reinforcement Learning

Learning through trial and error with rewards/penalties. An agent takes actions in an environment and learns from feedback. Used for games, robotics, and RLHF in LLMs. Different from supervised learning—there's no "correct answer," just better or worse outcomes.

PPO Proximal Policy Optimization

A reinforcement learning algorithm that updates policies in small, stable steps. Constrains how much the policy can change per update, preventing catastrophic performance drops. The most common algorithm for RLHF in LLMs—balances learning speed with training stability. Simpler to tune than earlier methods like TRPO.

Transfer Learning

Using knowledge from one task to improve performance on another. A model trained on ImageNet can be adapted for medical imaging with less data. The foundation of modern ML—don't train from scratch, start from a pre-trained model.

Fine-tuning

Continuing training on a pre-trained model with new, task-specific data. Takes a general model and specializes it. Much faster and cheaper than training from scratch. The standard approach: take a foundation model, fine-tune on your data.

Pre-training

Initial training on a large, general dataset before task-specific fine-tuning. LLMs are pre-trained on internet text to learn language patterns, then fine-tuned for specific tasks. Pre-training is expensive (millions of dollars for large models) but only done once.

Neural Network Architecture

Neural Network

A model composed of layers of interconnected nodes (neurons) that transform inputs through learned weights and non-linear activation functions. "Deep learning" just means neural networks with many layers. Universal function approximators—can theoretically learn any mapping.

Layer

A group of neurons that process inputs and pass outputs to the next layer. Input layer receives data, hidden layers perform transformations, output layer produces predictions. More layers = more capacity to learn complex patterns, but also more parameters and compute.

Activation Function

Non-linear function applied after each layer's linear transformation. Without activations, stacking layers would just be one big linear function. ReLU (max(0, x)) is common; GELU is popular in transformers. Introduces the non-linearity needed to learn complex patterns.

Non-linearity

The property that makes neural networks powerful. Linear functions can only learn straight-line relationships—stacking them just gives another linear function. Non-linear activations between layers let networks learn curves, boundaries, and complex patterns. Without non-linearity, a 100-layer network would be no more expressive than a single layer.

Convolutional Neural Network (CNN)

Architecture specialized for grid data like images. Uses convolution operations that slide learned filters across the input, detecting local patterns regardless of position. The standard architecture for computer vision. Key insight: local patterns (edges, textures) matter more than global pixel positions.

Recurrent Neural Network (RNN)

Architecture for sequential data that maintains hidden state across time steps. LSTM and GRU are variants that handle long-range dependencies better. Largely superseded by Transformers for most sequence tasks, but still used in some real-time applications.

Transformer

The dominant architecture for modern AI. Processes sequences in parallel using attention mechanisms rather than recurrence. Enables massive scaling and parallelization. GPT, BERT, and virtually all modern LLMs are transformers. The "T" in GPT and ChatGPT.

Attention

Mechanism that lets the model dynamically focus on relevant parts of the input when producing each output. "Self-attention" relates different positions in the same sequence. Enables direct connections between any two tokens regardless of distance—solving the long-range dependency problem.

Embedding

A learned dense vector representation of discrete items (words, users, products). Maps high-dimensional sparse data to low-dimensional continuous space where similar items are nearby. Word embeddings capture semantic relationships: king - man + woman ≈ queen.

Encoder-Decoder

Architecture pattern where an encoder processes input into a representation, and a decoder generates output from it. Used in translation, summarization. BERT is encoder-only (understanding), GPT is decoder-only (generation), T5 is encoder-decoder (both).

LLMs & Generative AI

Large Language Model (LLM)

Neural networks trained on massive text corpora to predict next tokens. "Large" typically means billions of parameters. Capabilities emerge from scale: reasoning, instruction-following, in-context learning. GPT-4, Claude, Llama are examples.

Token

The atomic unit of text that models process. Not quite words—common words are single tokens, rare words split into subwords. "Tokenization" breaks text into tokens. Affects cost (APIs charge per token) and context limits. Roughly 1 token ≈ 4 characters or ¾ of a word in English.

Context Window

Maximum number of tokens a model can process at once—its "working memory." Includes both input (your prompt) and output (the response). Ranges from 4K to 200K+ tokens depending on the model. Larger contexts enable longer documents but increase compute cost quadratically.

Prompt

The input text you give to an LLM to elicit a response. Prompt engineering is the practice of crafting inputs that produce desired outputs. Unlike traditional programming, you're "programming" with natural language—the prompt is your interface to the model.

System Prompt

Special instructions that set the model's behavior, persona, or constraints for a conversation. Processed before user messages. Where you define "You are a helpful coding assistant" or "Always respond in JSON." Not all models support system prompts distinctly from user prompts.

Temperature

Controls randomness in text generation. Temperature=0 is deterministic (always picks highest probability token). Higher values (0.7-1.0) increase creativity and variety but also errors. Lower values are better for factual tasks; higher for creative writing.

Top-p / Top-k Sampling

Methods to control which tokens the model considers when generating. Top-k: only consider the k most likely tokens. Top-p (nucleus): consider tokens until cumulative probability reaches p. Both prevent selecting very unlikely tokens while maintaining diversity.

Few-shot / Zero-shot Learning

Zero-shot: asking the model to perform a task with no examples. Few-shot: providing a few examples in the prompt before asking. LLMs can generalize from minimal examples—a key emergent capability. More examples usually improve performance but consume context.

RLHF Reinforcement Learning from Human Feedback

Training technique that uses human preferences to guide model behavior. Humans rank model outputs, a reward model learns these preferences, then RL optimizes the LLM against this reward. How models become helpful, harmless, and honest after pre-training.

Reward Model

A model trained to predict human preferences between different outputs. Given two responses, it scores which one humans would prefer. The key component of RLHF—it converts sparse human feedback into dense training signal. Also used standalone for ranking, filtering, or best-of-N sampling where you generate multiple responses and pick the highest-scored one.

Hallucination

When a model generates confident but false or fabricated information. LLMs predict plausible text, not truth—they'll invent citations, facts, or code that looks right but isn't. A fundamental limitation requiring validation, retrieval augmentation, or other mitigation.

Grounding

Connecting model outputs to verifiable sources or real-world data. Reduces hallucination by anchoring responses to retrieved documents, databases, or APIs. RAG is a grounding technique. Grounded responses can cite sources; ungrounded ones cannot.

RAG & Retrieval

RAG Retrieval-Augmented Generation

Architecture that retrieves relevant documents and includes them in the LLM's context before generating. Combines the knowledge retrieval of search with the synthesis ability of LLMs. The standard pattern for building LLM apps over your own data—cheaper and more accurate than fine-tuning for many use cases.

Vector Database

Database optimized for storing and searching embedding vectors by similarity. Core infrastructure for RAG—stores document embeddings and efficiently finds the most relevant chunks for a query. Pinecone, Weaviate, Chroma, pgvector are examples.

Semantic Search

Finding documents by meaning rather than keyword matching. Converts query and documents to embeddings, then finds nearest neighbors. "Running shoes" finds "athletic footwear" even without word overlap. The retrieval mechanism in RAG systems.

Chunking

Splitting documents into smaller pieces for embedding and retrieval. Chunk size affects retrieval quality—too small loses context, too large dilutes relevance. Common strategies: fixed size, sentence boundaries, recursive splitting, semantic chunking. A key RAG tuning parameter.

Reranking

A second-stage retrieval step that uses a more expensive model to reorder initial search results. First stage retrieves many candidates quickly (vector search), reranker scores each for relevance more accurately. Significantly improves RAG quality at modest compute cost.

Tools & Infrastructure

ML Framework

Libraries for building and training models. PyTorch dominates research and increasingly production. TensorFlow is common in enterprise. JAX is used at Google/DeepMind. They provide automatic differentiation, GPU acceleration, and neural network building blocks.

GPU

Graphics Processing Units repurposed for ML. Neural networks are mostly matrix operations—GPUs do these massively in parallel. NVIDIA dominates with CUDA. Training large models requires expensive GPU clusters; inference can often run on smaller hardware. The hardware constraint that shapes modern AI.

Quantization

Reducing model precision from 32/16-bit floats to 8-bit or 4-bit integers. Dramatically reduces memory and speeds up inference with modest quality loss. How you run a 70B model on consumer hardware. GPTQ, GGML/GGUF, AWQ are common quantization formats.

ONNX

Open Neural Network Exchange—a standard format for representing models across frameworks. Train in PyTorch, export to ONNX, deploy with ONNX Runtime or convert to other formats. Like a portable binary format for models.

Model Serving

Infrastructure for running inference at scale. Handles batching, load balancing, GPU scheduling, model versioning. vLLM, TGI, Triton, TensorRT-LLM are popular. The deployment layer between your trained model and production traffic.

Latency vs Throughput

Latency: time for one request. Throughput: requests per second. Often traded off—batching improves throughput but increases latency. For LLMs, also track time-to-first-token (responsiveness) and tokens-per-second (generation speed).

MLOps

DevOps for ML—practices for deploying, monitoring, and maintaining models in production. Includes versioning data and models, experiment tracking, CI/CD for ML, monitoring for drift and degradation. Where software engineering meets ML engineering.

Practical Concerns

Data Drift

When production data distribution shifts from training data, degrading model performance. User behavior changes, new products launch, world events happen. Models don't automatically adapt—you need monitoring to detect drift and retraining to address it.

Bias

Systematic errors reflecting prejudices in training data or problem framing. A hiring model trained on historical decisions may perpetuate past discrimination. Requires careful dataset curation, evaluation across subgroups, and often explicit fairness constraints.

Explainability

Understanding why a model made a specific prediction. Important for debugging, compliance, and trust. Simpler models (linear, decision trees) are inherently interpretable. For neural networks, techniques like SHAP, attention visualization, or probing try to extract explanations post-hoc.

Evaluation

Measuring model quality systematically. Beyond test set metrics: human evaluation, A/B testing, red teaming for safety. For LLMs, evaluation is especially challenging—automatic metrics often don't correlate with usefulness. "Evals" are a major focus of LLM development.

Prompt Injection

Attack where malicious input manipulates an LLM to ignore instructions or behave unexpectedly. Like SQL injection for AI. User input that says "ignore previous instructions and..." can override system prompts. A key security concern for LLM applications with untrusted input.

Agent

LLM-powered system that can take actions, use tools, and work autonomously toward goals. Goes beyond Q&A to executing multi-step tasks: browsing the web, writing and running code, calling APIs. ReAct, function calling, and tool use are common patterns.

Function Calling

LLM capability to output structured requests for external tools/APIs. Model decides when to call functions and with what arguments; your code executes them and returns results. Bridges LLMs to real-world actions: database queries, API calls, code execution.

Chain of Thought

Prompting technique where the model shows its reasoning steps before answering. Significantly improves performance on math, logic, and multi-step problems. "Let's think step by step" is the classic trigger. The model's intermediate reasoning helps it reach correct answers.