AI Explained

Why HNSW Vector Search Is Fast

Aditya Kumar JhaAditya Kumar JhaLinkedIn·June 7, 2026·12 min read

HNSW vector search is fast because it is approximate. Here is how the graph trades a little recall for huge speed at scale.

Vector search stays fast because HNSW, the approximate nearest neighbor (ANN) index behind most modern vector databases, never compares your query against every stored vector. Instead it navigates a small graph to a tiny, well-chosen subset and returns matches that are close enough, trading a sliver of accuracy for an enormous jump in speed across millions of vectors. That is how a search can answer in a few milliseconds: by refusing to look at almost all of them.

That tradeoff is the whole trick. Exact search guarantees the true nearest neighbors but pays for it by touching the entire dataset. Approximate search gives up the guarantee and, in exchange, answers in milliseconds across millions or billions of vectors. The HNSW graph is the structure that makes the approximation both fast and accurate enough to trust in production.

Insight

The counterintuitive part: a search that returns nine of the ten true matches in two milliseconds beats one that returns all ten in a full second. Almost nobody needs the perfect tenth result.

Insight

This post assumes you already know what a vector embedding is. If not, read the MemX explainer on vector embeddings without math first. Here the focus is strictly the index that powers similarity search and semantic search at scale, and how that search stays fast, not what the numbers inside a vector mean.

Exact vs approximate nearest neighbor search: what is the difference?

Exact nearest neighbor search, also called brute-force or flat search, compares the query against every single vector in the database and ranks them by distance. It always finds the true closest matches. The cost is linear: double the data, double the work. For a few thousand vectors that is fine. For millions, with each vector holding hundreds or thousands of dimensions, scanning everything on every query becomes too slow and too expensive.

Approximate nearest neighbor (ANN) search drops the guarantee on purpose. As Elastic frames it, ANN sacrifices perfect accuracy in exchange for executing efficiently in high dimensional embedding spaces, at scale, whereas exact methods like kNN lead to excessive execution times and heavy resource use.

The reason this works is that most applications do not need the mathematically perfect top result. They need results that are clearly relevant. A search that returns most of the true nearest neighbors, in a few milliseconds, beats a search that returns every one but takes a full second. ANN turns search from an exhaustive scan into a quick, guided walk.

What is recall@k in vector search?

Every ANN index sits on a dial between speed and recall. Recall is the measure of how many true neighbors the approximate search actually found. Crank the dial toward speed and you visit fewer candidates and miss more. Crank it toward accuracy and you visit more and slow down. The operating point is something you choose, not something you get for free.

Recall@k is the standard quality metric. It is the fraction of the true top-k nearest neighbors that the approximate search returned in its own top-k. Formally, recall@k equals the size of the intersection between the approximate result set and the ground-truth set, divided by k. A recall@10 of 0.9 means the index found nine of the ten genuinely closest vectors. To know that number you need ground truth, which means running an exact search once on a sample to compare against.

The practical workflow is to pick a recall target, often recall@10 or recall@100, then tune the index until it hits that target at the lowest possible latency. A recommendation feed might happily run at 0.9 recall for speed. A legal or medical retrieval system might push for 0.99 and accept slower queries. The point is that recall is a knob you set deliberately, not an accident you discover later.

Pro Tip

Measure recall on your own data and query distribution, not on a public benchmark. The same index settings can produce very different recall depending on how your vectors cluster, so always sample real queries and compare against an exact-search ground truth before shipping.

What is HNSW (Hierarchical Navigable Small World)?

HNSW stands for Hierarchical Navigable Small World graph, and it is the index behind most modern vector search. It was introduced by Yu. A. Malkov and D. A. Yashunin in their paper introducing Hierarchical Navigable Small World graphs, which describes a multi-layer structure of proximity graphs that achieves logarithmic-scaling search complexity.

In the graph, each vector is a node, and edges connect it to its nearby vectors. Search becomes navigation: start somewhere, and keep hopping to whichever connected neighbor is closer to the query, until no closer neighbor exists. That greedy walk works, but on a single flat graph it can stall, wandering dense local regions before it reaches the right neighborhood. HNSW fixes that with layers.

The layered, skip-list-like structure

HNSW stacks several graph layers. The authors note the algorithm's similarity to the skip list data structure, where sparse upper layers let you take long jumps and the dense bottom layer lets you do fine-grained search. The top layers contain only a few nodes and act like a rough map of the space. The bottom layer contains every node and gives the detailed view.

The authors select the maximum layer in which an element is present randomly with an exponentially decaying probability distribution, so most nodes live only on the bottom layer and progressively fewer reach the higher ones. This is exactly the skip-list trick applied to a graph: a thin express network on top for covering distance fast, a complete local network on the bottom for precision.

Greedy traversal from top to bottom

Search enters at a single node in the topmost, sparsest layer and greedily walks toward the query, repeatedly moving to the locally closest connected node. Wikipedia describes this as greedy navigation, where the algorithm repeatedly chooses locally better nodes to approach the query point. When it can get no closer in that layer, it drops down one layer and continues from where it landed.

Insight

Here is the part that should feel a little magical: to find the closest vector among a billion, HNSW looks at only a few thousand of them, and still finds it almost every time.

Think of it like taking the highway before the side streets: the upper layers cover the long distances in a handful of big hops, so by the time the walk reaches the bottom layer it is already in the right region and only needs to refine. That layered descent is what gives HNSW its logarithmic-scaling behavior instead of the linear cost of scanning everything. Each query therefore touches a tiny slice of the data instead of all of it. That is the structural reason a graph index like HNSW usually beats a cluster index like IVF on the speed-recall curve: it refines continuously toward the query instead of committing up front to a fixed set of clusters.

What do M, efConstruction, and efSearch do in HNSW?

HNSW exposes three parameters that control the speed, recall, memory, and build-time tradeoffs. Two are set when you build the index and one is set per query.

  • M: the maximum number of connections each node keeps per layer. Higher M builds a richer graph with better recall and faster convergence, but uses more memory and takes longer to build. In pgvector this defaults to 16.
  • efConstruction: the size of the candidate list used while building the graph. Higher values explore more neighbors during insertion, improving the graph's quality and final recall, at the cost of slower index build. In pgvector this defaults to 64.
  • efSearch: the size of the candidate list used at query time. Higher efSearch visits more nodes per search, raising recall but increasing latency. This is the live speed-versus-recall dial you turn without rebuilding the index.

pgvector documents M as the max number of connections per layer and efConstruction as the size of the dynamic candidate list for constructing the graph, noting that a higher efConstruction provides better recall at the cost of index build time and insert speed.

M and efConstruction shape the graph once, trading build time and memory for a better foundation. efSearch then chooses how hard to search that fixed graph on each query. Tune the first two for the index you can afford to store and build, then move efSearch until recall and latency land where you want them.

HNSW vs IVF vs Flat

Three index families cover most use cases. Flat is exact brute-force. IVF (inverted file) partitions vectors into clusters and searches only the clusters nearest the query. HNSW is the layered graph. Each makes a different bargain across speed, recall, memory, and build time.

PropertyHNSW (graph)IVF (clusters)Flat (brute-force)
Search speedVery fast, logarithmic scalingFast, scans only nearby clustersSlowest, scans every vector
Recall / accuracyHigh, tunable via efSearchGood, tunable via nprobePerfect by definition
Memory useHigh, stores graph edgesLow, stores cluster listsLowest, just the vectors
Build timeSlow, builds the graphFaster, plus a training stepNone, no index to build
Best whenLarge data, low latency, high recallLarge data, memory-constrainedSmall data or recall must be exact
Pro Tip

Quick decision rule: under roughly ten thousand vectors, use Flat, since the exact answer is cheap and recall is perfect. From thousands to billions where latency and recall both matter, use HNSW. Choose IVF only when memory or fast rebuilds are the binding constraint, since it trades query performance for a lighter, faster-to-build index. In short, HNSW by default at scale, Flat when small, IVF when memory-bound.

pgvector summarizes the practical contrast: HNSW has better query performance than IVFFlat in the speed-recall tradeoff and needs no training step, while IVFFlat has faster build times and uses less memory but lower query performance.

FAISS offers all three through index types such as IndexFlatL2 for exact search, IndexIVFFlat for cluster-based search, and IndexHNSWFlat for the graph, exposing efConstruction and efSearch directly. The FAISS index documentation states the fundamental rule plainly: a typical way to speed up the process at the cost of losing the guarantee to find the nearest neighbor is to employ a partitioning technique. Every ANN family makes that same bargain in its own way.

Where is HNSW used? (pgvector, FAISS, managed vector databases)

HNSW already ships in the tools teams use every day. pgvector adds HNSW and IVFFlat indexes to PostgreSQL, so a normal relational database becomes a vector store. FAISS, the open-source library from Meta, implements flat, IVF, and HNSW indexes for large-scale similarity search. Managed vector databases and search engines layer HNSW under a hosted API so you never touch the graph directly. This is the same index that powers retrieval-augmented generation (RAG), where an LLM retrieves relevant documents by similarity search before answering.

Because the underlying algorithm is the same, the parameter intuition transfers. Whether the knob is called M and efSearch in pgvector and FAISS, or hidden behind a tier setting in a managed service, you are always choosing the same thing: how much of the graph to explore, and therefore how much recall to buy with how much latency.

The catch most HNSW explainers skip

Here is what most explainers leave out: HNSW is fast only while its graph lives in RAM. The graph stores many edges per node, so the index is large, and traversal jumps around memory at random. The ParadeDB writeup on pgvector limitations puts it bluntly: for acceptable query latency the index needs to live in shared_buffers or the OS page cache, and once it spills to disk, p99 latency climbs sharply. Size your memory for the index, not just the raw vectors.

Deletes hide a second trap. An HNSW graph does not reclaim space when you remove rows. Deleted nodes sit as tombstones in the graph until you rebuild the index, so a workload that re-embeds and churns constantly bloats over time and slowly loses recall. That is why teams with high-churn data schedule periodic rebuilds instead of treating the index as set-and-forget.

How a consumer AI memory app uses vector search

The same approximate tradeoff shows up in a consumer app you can actually hold. MemX is an AI second brain: you dump in photos, PDFs, scanned documents, voice notes, and WhatsApp messages, and it reads and indexes them on your behalf. When you later ask a plain-English question, MemX does not scan every capture you have ever saved. It runs an ANN similarity search over your indexed content to fetch the few that matter, then answers and cites the original source so you can check it. That is the same bargain running through this post: give up a sliver of exhaustive recall to get an answer back in a moment instead of a minute. MemX is on Android, in iOS TestFlight beta, and on WhatsApp today, with web coming soon.

Frequently Asked Questions
01Is HNSW accurate?

No. Exact (flat) search is perfectly accurate by definition because it compares every vector. HNSW is approximate and can miss true neighbors. Its advantage is speed at scale, and it can reach very high recall, often above 0.95, with the right efSearch setting.

02What is efSearch in HNSW?

efSearch sets how many candidate nodes the graph visits during a query. Raising it makes the search explore more of the graph, increasing recall but adding latency. It is the main runtime dial for the speed-versus-recall tradeoff, and you can change it without rebuilding the index.

03When should I use IVF instead of HNSW?

Choose IVF when memory is tight or build time matters, since it uses less memory and builds faster. Choose HNSW when you want the best speed-recall tradeoff and can afford the extra memory. For small datasets, exact flat search is simplest and exact.

04Is higher recall always better?

Not necessarily. Recall measures how many true nearest neighbors you found, not whether users notice the difference. Many applications feel identical at 0.9 and 0.99 recall while the lower setting is much faster, so tune recall to user-visible quality, not to a perfect score.

05Why is vector search called approximate?

Because it deliberately skips most of the data. Instead of comparing the query to every vector, ANN indexes like HNSW navigate a graph or probe a few clusters and return matches that are close enough. Dropping the exactness guarantee is exactly what makes search fast at scale.

Read Next

Or try MemX to access 40+ AI models in one place — including Claude Sonnet 4.6 and GPT-5.4 — and get your questions answered today.

Was this article helpful?

Found this useful? Share it with someone who needs it.

Free · iOS, Android & WhatsApp

Stop losing what you save.
Let MemX remember it for you.

Every screenshot, photo, PDF and voice note — captured, encrypted, and instantly searchable. Ask in plain English, get the answer in seconds.

  • Reads text inside images and handwriting
  • Private and encrypted by default
  • Free to start, no credit card

Takes under a minute to set up. Your data stays yours.

Aditya Kumar Jha
Written by
Aditya Kumar JhaLinkedIn

Core software engineer at MemX, where he builds the website, backend, and data systems. Also a published author of six books on Amazon KDP, writing on AI, memory, and behavior.

Keep reading

More guides for AI-powered students.