Retrieval & Context

Vector Similarity Metrics (Cosine vs Dot Product vs Euclidean)

By Arpit Tripathi, Founder

Vector similarity metrics measure how close two embeddings are. Cosine similarity compares direction (angle), dot product combines direction and magnitude, and Euclidean distance measures straight-line distance. The right choice depends on whether embeddings are normalized.

What are Vector Similarity Metrics?

Vector similarity metrics are mathematical functions that quantify how similar two vectors, typically embeddings, are to each other. In semantic search and retrieval-augmented generation, text, images, and other data are converted into high-dimensional embeddings, and a similarity metric decides which stored vectors are closest to a query vector. The choice of metric determines what closeness means and directly affects retrieval quality.

The three most common metrics are cosine similarity, dot product (inner product), and Euclidean distance. They capture different geometric notions: angle, angle combined with magnitude, and straight-line distance. For normalized vectors they rank results identically, but on raw vectors they can behave very differently.

  • Similarity metrics score how close two embeddings are in vector space.
  • They power semantic search, RAG retrieval, and recommendation ranking.
  • Cosine, dot product, and Euclidean are the three standard choices.

Cosine Similarity

Cosine similarity measures the cosine of the angle between two vectors, ignoring their magnitudes. It ranges from -1 (opposite direction) through 0 (orthogonal) to 1 (same direction). Because it focuses only on direction, it is the most common default for text embeddings, where what matters is semantic orientation rather than vector length, which can vary with document length or token count.

Cosine similarity is computed as the dot product of the two vectors divided by the product of their magnitudes. This division is exactly what removes magnitude from the comparison. One caveat worth knowing: research by Steck et al. shows that on some learned embeddings, cosine similarity can be implicitly shaped by regularization and yield arbitrary results, so it is a sensible default rather than a guaranteed-best choice.

cosine(A, B) = (A · B) / (‖A‖ ‖B‖) = Σ AᵢBᵢ / (√Σ Aᵢ² · √Σ Bᵢ²)
Cosine similarity is the dot product normalized by both vector lengths, so only the angle between A and B affects the result.
  • Compares direction (angle), not magnitude.
  • Ranges from -1 to 1; higher means more similar.
  • A common default for text embeddings, though not universally optimal.

Dot Product and Euclidean Distance

Dot product (inner product) multiplies vectors element-wise and sums the result. It reflects both the angle between vectors and their magnitudes, so longer vectors can score higher. Many embedding models are trained so that the dot product is the intended similarity, and it is cheaper to compute than cosine because it skips the normalization step. On vectors that are already unit-normalized, dot product and cosine similarity are mathematically equivalent.

Euclidean distance (the L2 distance) measures the straight-line distance between the endpoints of two vectors. Unlike the others it is a distance, so smaller means more similar, with 0 meaning identical. It is sensitive to magnitude and is common when absolute position in the space carries meaning, such as some image embeddings or clustering tasks.

A · B = Σ AᵢBᵢ d(A, B) = √Σ (Aᵢ − Bᵢ)²
Left: dot product sums the element-wise products. Right: Euclidean distance is the square root of summed squared differences; smaller distance means closer vectors.
  • Dot product captures direction and magnitude; larger means more similar.
  • Euclidean distance is a straight-line distance; smaller means more similar.
  • On unit-normalized vectors, dot product equals cosine similarity and ranks the same as Euclidean.

Choosing and Computing a Metric

The practical rule: if your embeddings are normalized to unit length, cosine, dot product, and Euclidean produce the same ranking, so pick the cheapest your index supports (often dot product). If vectors are not normalized and magnitude is noise, use cosine. If magnitude carries real meaning, dot product or Euclidean may be appropriate. Always match the metric to the one the embedding model was trained with, which the model card usually states.

Vector databases let you select the metric at index creation, and approximate-nearest-neighbor indexes support all three. The snippet below computes the three metrics on the same pair of vectors with NumPy.

python
import numpy as np

a = np.array([0.2, 0.5, 0.1, 0.8])
b = np.array([0.1, 0.4, 0.3, 0.7])

dot = np.dot(a, b)
cosine = dot / (np.linalg.norm(a) * np.linalg.norm(b))
euclidean = np.linalg.norm(a - b)

print(f"dot product:        {dot:.4f}")
print(f"cosine similarity:  {cosine:.4f}")
print(f"euclidean distance: {euclidean:.4f}")
Computing cosine similarity, dot product, and Euclidean distance with NumPy.
  • Normalized vectors: all three rank identically, so choose the cheapest.
  • Use cosine when magnitude is noise; use dot product or Euclidean when magnitude matters.
  • Match the metric to what the embedding model was trained on.

Key takeaways

  • Cosine similarity compares direction (angle) and ignores magnitude; it is a common default for text embeddings.
  • Dot product combines direction and magnitude and is cheaper to compute than cosine.
  • Euclidean distance is a straight-line distance where smaller values mean more similar.
  • On unit-normalized vectors, all three metrics produce the same ranking.
  • Always match the metric to the one the embedding model was trained with.

Frequently asked questions

Cosine similarity measures only the angle between two vectors, ignoring their lengths. Dot product accounts for both angle and magnitude, so longer vectors can score higher. On unit-normalized vectors the two are equivalent, since cosine is just the dot product divided by the lengths.
Use the metric the embedding model was trained with, stated on its model card. Cosine similarity is a safe default for text. If your vectors are unit-normalized, cosine, dot product, and Euclidean rank identically, so you can pick the fastest your index supports.
For cosine similarity and dot product, higher means more similar. For Euclidean distance, lower means more similar, with 0 meaning the vectors are identical. Distance metrics and similarity metrics are inverted, so check which one your vector database returns.
Text embedding magnitudes can vary with document length and other factors that do not reflect meaning. Cosine similarity ignores magnitude and compares only direction, so it captures semantic orientation cleanly. That said, it is not infallible; research shows it can behave unexpectedly on some learned embeddings, so it is a strong default rather than a guarantee.
Not in general, but they are closely related. For unit-normalized vectors, ranking by cosine similarity gives the same order as ranking by Euclidean distance, because squared Euclidean distance equals 2 minus 2 times cosine similarity. On raw vectors they can differ.