Object detection is a computer vision task that finds and classifies every object in an image, drawing a labeled bounding box with a confidence score around each one. YOLO performs this as a single regression pass for real-time speed.
What is Object Detection?
Object detection is the computer vision task of locating and classifying multiple objects within a single image or video frame. Unlike image classification, which assigns one label to the whole image, detection answers two questions at once for every object present: what is it, and where is it. The output is a set of bounding boxes, each a rectangle defined by coordinates, paired with a class label (such as person, car, or dog) and a confidence score between 0 and 1.
YOLO, short for You Only Look Once, is the best known family of real-time detectors. It reframes detection as a single regression problem: the network looks at the full image once and predicts box coordinates and class probabilities directly, rather than running a separate region-proposal stage. This single-pass design is what makes YOLO fast enough for video and edge deployment.
- Classification asks what is in the image; detection asks what and where for every object.
- Each detection is a bounding box plus a class label plus a confidence score.
- YOLO treats detection as one regression pass over the whole image, enabling real-time inference.
How YOLO Works: Grid, Anchors, and Confidence
The original YOLO divides the input image into an S x S grid. Each grid cell is responsible for detecting objects whose center falls inside it, and predicts a fixed number of bounding boxes along with confidence scores and class probabilities. A box prediction encodes center coordinates (x, y), width and height (w, h), and an objectness confidence indicating how likely the box contains an object and how well it fits.
Many detectors, including several earlier YOLO versions, use anchor boxes: predefined box shapes of common aspect ratios. The network predicts offsets relative to these anchors rather than raw coordinates, which stabilizes training and lets a single cell detect multiple overlapping objects. The current Ultralytics release, YOLO26, is anchor-free and natively end-to-end: it produces predictions directly without a separate Non-Maximum Suppression stage, which lowers post-processing latency and simplifies deployment.
- Image is split into an S x S grid; each cell predicts boxes, confidence, and class scores.
- Anchor boxes are template shapes the model adjusts via predicted offsets.
- YOLO26, the current Ultralytics release, is anchor-free and natively NMS-free end-to-end.
IoU, NMS, and Mean Average Precision
Intersection over Union (IoU) measures overlap between two boxes by dividing the area of their intersection by the area of their union. It scores how well a predicted box matches a ground-truth box, and is the threshold used to decide whether a detection counts as correct.
A single object often triggers many overlapping predictions in classic detectors. Non-Maximum Suppression (NMS) cleans this up: it keeps the highest-confidence box and discards any other box that overlaps it above an IoU threshold (commonly 0.5). Detection accuracy is then summarized by mean Average Precision (mAP), which averages precision across recall levels and across classes, often reported at IoU thresholds such as [email protected] or the stricter [email protected]:0.95.
- IoU = intersection area divided by union area; 1.0 is a perfect overlap.
- NMS removes duplicate boxes for the same object above an IoU threshold.
- mAP is the standard accuracy metric, often reported at [email protected] and [email protected]:0.95.
Running YOLO in Practice
The Ultralytics library is the most common way to run modern YOLO. A pretrained model loads in a single line and returns detections as boxes with confidence and class indices. The example below loads a small pretrained YOLO26 checkpoint, runs inference on an image, and prints each detected class with its confidence.
- Pretrained checkpoints (n, s, m, l, x sizes) trade accuracy for speed.
- Results expose boxes, confidence scores, and class IDs you can iterate over.
- The same API also covers segmentation, pose, and classification tasks.
Where Object Detection Is Used
Object detection underpins applications where knowing the location of things matters, not just their presence. Common deployments include autonomous driving (pedestrians, vehicles, signs), retail and inventory counting, medical imaging, agricultural monitoring, security and surveillance, and document processing where layout regions must be found before text is read.
The real-time, single-pass nature of YOLO makes it a frequent choice for edge devices and live video, where latency budgets are tight and a server round trip is not always available.
- Driving, retail counting, medical imaging, agriculture, and security are core use cases.
- Real-time speed makes YOLO popular for edge and on-device deployment.
- Detection often feeds downstream steps like tracking or OCR.
Key takeaways
- Object detection outputs a labeled bounding box and confidence score for every object in an image.
- YOLO frames detection as a single regression pass over the full image, enabling real-time speed.
- IoU measures box overlap, NMS removes duplicate boxes, and mAP summarizes overall accuracy.
- The Ultralytics API runs pretrained YOLO models in a few lines and returns boxes, classes, and confidences.
- YOLO26, the current Ultralytics release, is anchor-free and natively end-to-end without a separate NMS step.
Frequently asked questions
Related terms
Related reading
Sources
Put the idea into practice
MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.
Try MemX Free