AI Foundations

Rotary Position Embedding (RoPE)

Rotary Position Embedding (RoPE) encodes a token's position by rotating its query and key vectors by an angle proportional to its position, so the attention dot product depends only on the relative distance between tokens. It is the default position encoding in modern LLMs such as LLaMA, Mistral, Qwen, and DeepSeek.

What is Rotary Position Embedding (RoPE)?

Rotary Position Embedding (RoPE) is a method for injecting positional information into a transformer's self-attention by rotating each query and key vector by an angle that grows with the token's position in the sequence. Because attention scores come from the dot product of a query and a key, and rotation preserves dot products in a way that depends only on the difference of the two rotation angles, the resulting score depends on the relative distance between the two tokens rather than their absolute indices.

RoPE was introduced in the 2021 paper RoFormer by Jianlin Su and colleagues. Unlike learned absolute position embeddings, which add a position vector to the token embedding, RoPE multiplies the query and key by a position-dependent rotation matrix inside each attention head. This keeps the norm of the vectors unchanged and folds relative position directly into the attention computation.

Acts on queries and keys, not on the input embeddings or the value vectors.
Encodes absolute position via rotation but yields attention that is a function of relative position.
Adds no learned parameters: the rotation angles are fixed by a frequency schedule.

How RoPE works mathematically

RoPE splits each query or key vector into pairs of coordinates and treats each pair as a point in a 2D plane. Coordinate pair i is rotated by an angle m·theta_i, where m is the token position and theta_i is a fixed per-pair frequency. Low-index pairs rotate quickly and capture short-range position, while high-index pairs rotate slowly and capture long-range position.

The relative-position property follows from a basic fact of rotations: applying rotation by angle a to a query and rotation by angle b to a key makes their inner product depend on a minus b. With a at position m and b at position n, the attention score depends on m minus n, the relative distance.

R(m)·x : for each pair (x_{2i}, x_{2i+1}), [x'_{2i}, x'_{2i+1}] = [x_{2i}·cos(m·theta_i) - x_{2i+1}·sin(m·theta_i),  x_{2i}·sin(m·theta_i) + x_{2i+1}·cos(m·theta_i)]

Each consecutive pair of coordinates in the query or key vector is rotated by the angle m·theta_i, where m is the token position and theta_i is the fixed frequency for that pair.

(R(m)q) · (R(n)k) = q^T R(n - m) k

Rotating the query by position m and the key by position n makes their dot product depend only on the relative offset n minus m, which is the core relative-position property of RoPE.

theta_i follows a geometric schedule, typically theta_i = base^(-2i/d) with base = 10000 and head dimension d.
The construction gives a natural decay of attention with increasing relative distance.
Long-context variants rescale the frequencies; NTK-aware scaling and YaRN extend the usable context window.

Implementing RoPE

A common and efficient implementation precomputes the cosine and sine of every angle, then applies the rotation using a rotate-half trick instead of an explicit matrix multiply. The snippet below shows the canonical PyTorch pattern used in LLaMA-style models.

python

import torch

def build_rope_cache(seq_len, head_dim, base=10000.0):
    # frequencies for each coordinate pair
    inv_freq = 1.0 / (base ** (torch.arange(0, head_dim, 2).float() / head_dim))
    pos = torch.arange(seq_len).float()
    angles = torch.outer(pos, inv_freq)        # (seq_len, head_dim/2)
    emb = torch.cat([angles, angles], dim=-1)  # (seq_len, head_dim)
    return emb.cos(), emb.sin()

def rotate_half(x):
    x1, x2 = x.chunk(2, dim=-1)
    return torch.cat([-x2, x1], dim=-1)

def apply_rope(q, k, cos, sin):
    # cos, sin broadcast over (batch, heads, seq, head_dim)
    cos = cos[None, None, :, :]
    sin = sin[None, None, :, :]
    q_rot = q * cos + rotate_half(q) * sin
    k_rot = k * cos + rotate_half(k) * sin
    return q_rot, k_rot

Minimal RoPE applied to query and key tensors of shape (batch, heads, seq, head_dim).

Why modern LLMs use RoPE

RoPE became the default position encoding for open-weight LLMs because it combines several practical advantages. It introduces no extra parameters, integrates relative position without a separate bias term, and extrapolates to longer sequences better than fixed absolute embeddings when paired with frequency-scaling techniques.

Models including LLaMA, Mistral, Qwen, Gemma, and DeepSeek all use RoPE. The method is also a building block for newer attention variants: DeepSeek's Multi-Head Latent Attention keeps a decoupled RoPE component precisely because rotary embeddings cannot be folded into a compressed key-value latent without special handling.

Parameter-free and easy to add to any attention head.
Supports context-length extension via NTK-aware scaling and YaRN.
Compatible with efficient attention kernels such as FlashAttention.

Key takeaways

RoPE rotates query and key vectors by a position-dependent angle so attention scores depend on relative distance.
It adds no learned parameters and preserves vector norms, unlike additive absolute position embeddings.
Frequency-scaling methods like NTK-aware scaling and YaRN extend RoPE to longer contexts than seen in training.
RoPE is the default position encoding in LLaMA, Mistral, Qwen, Gemma, and DeepSeek models.

Frequently asked questions

It is a way to tell a transformer where each token sits in a sequence by rotating the token's query and key vectors. The rotation angle grows with position, so when two tokens are compared their attention score reflects how far apart they are.

Absolute embeddings add a learned position vector to each token. RoPE instead multiplies queries and keys by a fixed rotation, adds no parameters, preserves vector length, and makes attention depend on relative distance rather than absolute index.

RoPE encodes position through rotation frequencies that can be rescaled after training. Techniques such as NTK-aware interpolation and YaRN adjust these frequencies so a model handles sequences longer than it saw during pretraining without full retraining.

RoPE is used by most modern open-weight LLMs, including LLaMA, Mistral, Qwen, Gemma, and DeepSeek. It is also the position-encoding basis inside newer attention designs such as Multi-Head Latent Attention.

Very little. The rotation is applied once to queries and keys per layer and is usually implemented with precomputed cosine and sine tables plus a rotate-half operation, so it adds no learned parameters and negligible overhead.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free