AI Foundations

Object Detection (YOLO / Bounding Boxes)

Object detection is a computer vision task that finds and classifies every object in an image, drawing a labeled bounding box with a confidence score around each one. YOLO performs this as a single regression pass for real-time speed.

What is Object Detection?

Object detection is the computer vision task of locating and classifying multiple objects within a single image or video frame. Unlike image classification, which assigns one label to the whole image, detection answers two questions at once for every object present: what is it, and where is it. The output is a set of bounding boxes, each a rectangle defined by coordinates, paired with a class label (such as person, car, or dog) and a confidence score between 0 and 1.

YOLO, short for You Only Look Once, is the best known family of real-time detectors. It reframes detection as a single regression problem: the network looks at the full image once and predicts box coordinates and class probabilities directly, rather than running a separate region-proposal stage. This single-pass design is what makes YOLO fast enough for video and edge deployment.

Classification asks what is in the image; detection asks what and where for every object.
Each detection is a bounding box plus a class label plus a confidence score.
YOLO treats detection as one regression pass over the whole image, enabling real-time inference.

How YOLO Works: Grid, Anchors, and Confidence

The original YOLO divides the input image into an S x S grid. Each grid cell is responsible for detecting objects whose center falls inside it, and predicts a fixed number of bounding boxes along with confidence scores and class probabilities. A box prediction encodes center coordinates (x, y), width and height (w, h), and an objectness confidence indicating how likely the box contains an object and how well it fits.

Many detectors, including several earlier YOLO versions, use anchor boxes: predefined box shapes of common aspect ratios. The network predicts offsets relative to these anchors rather than raw coordinates, which stabilizes training and lets a single cell detect multiple overlapping objects. The current Ultralytics release, YOLO26, is anchor-free and natively end-to-end: it produces predictions directly without a separate Non-Maximum Suppression stage, which lowers post-processing latency and simplifies deployment.

Image is split into an S x S grid; each cell predicts boxes, confidence, and class scores.
Anchor boxes are template shapes the model adjusts via predicted offsets.
YOLO26, the current Ultralytics release, is anchor-free and natively NMS-free end-to-end.

IoU, NMS, and Mean Average Precision

Intersection over Union (IoU) measures overlap between two boxes by dividing the area of their intersection by the area of their union. It scores how well a predicted box matches a ground-truth box, and is the threshold used to decide whether a detection counts as correct.

A single object often triggers many overlapping predictions in classic detectors. Non-Maximum Suppression (NMS) cleans this up: it keeps the highest-confidence box and discards any other box that overlaps it above an IoU threshold (commonly 0.5). Detection accuracy is then summarized by mean Average Precision (mAP), which averages precision across recall levels and across classes, often reported at IoU thresholds such as [email protected] or the stricter [email protected]:0.95.

IoU(A, B) = area(A ∩ B) / area(A ∪ B)

IoU is the overlap of predicted box A and ground-truth box B divided by their combined area; values near 1 mean a tight match.

IoU = intersection area divided by union area; 1.0 is a perfect overlap.
NMS removes duplicate boxes for the same object above an IoU threshold.
mAP is the standard accuracy metric, often reported at [email protected] and [email protected]:0.95.

Running YOLO in Practice

The Ultralytics library is the most common way to run modern YOLO. A pretrained model loads in a single line and returns detections as boxes with confidence and class indices. The example below loads a small pretrained YOLO26 checkpoint, runs inference on an image, and prints each detected class with its confidence.

python

from ultralytics import YOLO

# Load a small pretrained detection model
model = YOLO("yolo26n.pt")

# Run inference on an image
results = model.predict(source="street.jpg", conf=0.25)

for r in results:
    for box in r.boxes:
        cls_id = int(box.cls[0])
        label = model.names[cls_id]
        confidence = float(box.conf[0])
        x1, y1, x2, y2 = box.xyxy[0].tolist()
        print(f"{label}: {confidence:.2f} at [{x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f}]")

Loading a pretrained YOLO26 model and printing detected objects with Ultralytics.

Pretrained checkpoints (n, s, m, l, x sizes) trade accuracy for speed.
Results expose boxes, confidence scores, and class IDs you can iterate over.
The same API also covers segmentation, pose, and classification tasks.

Where Object Detection Is Used

Object detection underpins applications where knowing the location of things matters, not just their presence. Common deployments include autonomous driving (pedestrians, vehicles, signs), retail and inventory counting, medical imaging, agricultural monitoring, security and surveillance, and document processing where layout regions must be found before text is read.

The real-time, single-pass nature of YOLO makes it a frequent choice for edge devices and live video, where latency budgets are tight and a server round trip is not always available.

Driving, retail counting, medical imaging, agriculture, and security are core use cases.
Real-time speed makes YOLO popular for edge and on-device deployment.
Detection often feeds downstream steps like tracking or OCR.

Key takeaways

Object detection outputs a labeled bounding box and confidence score for every object in an image.
YOLO frames detection as a single regression pass over the full image, enabling real-time speed.
IoU measures box overlap, NMS removes duplicate boxes, and mAP summarizes overall accuracy.
The Ultralytics API runs pretrained YOLO models in a few lines and returns boxes, classes, and confidences.
YOLO26, the current Ultralytics release, is anchor-free and natively end-to-end without a separate NMS step.

Frequently asked questions

YOLO stands for You Only Look Once. The name reflects its core idea: the network processes the entire image in a single forward pass to predict all bounding boxes and class probabilities at once, rather than scanning regions repeatedly.

Image classification assigns one label to a whole image. Object detection finds and classifies multiple objects within the image and reports where each one is using bounding boxes, so it answers both what and where.

Intersection over Union (IoU) measures how well a predicted box overlaps a ground-truth box. It is the intersection area divided by the union area. A higher IoU means a tighter match, and a threshold like 0.5 decides if a detection is correct.

NMS is a post-processing step that removes duplicate detections of the same object. It keeps the highest-confidence box and discards overlapping boxes whose IoU exceeds a threshold, leaving one clean box per object. Newer detectors such as YOLO26 are end-to-end and skip this step.

As of 2026, YOLO26 is the current Ultralytics release. It is anchor-free and natively NMS-free end-to-end, generating predictions directly for lower latency, and is optimized for edge and CPU deployment. For maximum accuracy without strict latency limits, transformer-based detectors are also competitive.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free