Rotary Position Embedding (RoPE) encodes a token's position by rotating its query and key vectors by an angle proportional to its position, so the attention dot product depends only on the relative distance between tokens. It is the default position encoding in modern LLMs such as LLaMA, Mistral, Qwen, and DeepSeek.
What is Rotary Position Embedding (RoPE)?
Rotary Position Embedding (RoPE) is a method for injecting positional information into a transformer's self-attention by rotating each query and key vector by an angle that grows with the token's position in the sequence. Because attention scores come from the dot product of a query and a key, and rotation preserves dot products in a way that depends only on the difference of the two rotation angles, the resulting score depends on the relative distance between the two tokens rather than their absolute indices.
RoPE was introduced in the 2021 paper RoFormer by Jianlin Su and colleagues. Unlike learned absolute position embeddings, which add a position vector to the token embedding, RoPE multiplies the query and key by a position-dependent rotation matrix inside each attention head. This keeps the norm of the vectors unchanged and folds relative position directly into the attention computation.
- Acts on queries and keys, not on the input embeddings or the value vectors.
- Encodes absolute position via rotation but yields attention that is a function of relative position.
- Adds no learned parameters: the rotation angles are fixed by a frequency schedule.
How RoPE works mathematically
RoPE splits each query or key vector into pairs of coordinates and treats each pair as a point in a 2D plane. Coordinate pair i is rotated by an angle m·theta_i, where m is the token position and theta_i is a fixed per-pair frequency. Low-index pairs rotate quickly and capture short-range position, while high-index pairs rotate slowly and capture long-range position.
The relative-position property follows from a basic fact of rotations: applying rotation by angle a to a query and rotation by angle b to a key makes their inner product depend on a minus b. With a at position m and b at position n, the attention score depends on m minus n, the relative distance.
- theta_i follows a geometric schedule, typically theta_i = base^(-2i/d) with base = 10000 and head dimension d.
- The construction gives a natural decay of attention with increasing relative distance.
- Long-context variants rescale the frequencies; NTK-aware scaling and YaRN extend the usable context window.
Implementing RoPE
A common and efficient implementation precomputes the cosine and sine of every angle, then applies the rotation using a rotate-half trick instead of an explicit matrix multiply. The snippet below shows the canonical PyTorch pattern used in LLaMA-style models.
Why modern LLMs use RoPE
RoPE became the default position encoding for open-weight LLMs because it combines several practical advantages. It introduces no extra parameters, integrates relative position without a separate bias term, and extrapolates to longer sequences better than fixed absolute embeddings when paired with frequency-scaling techniques.
Models including LLaMA, Mistral, Qwen, Gemma, and DeepSeek all use RoPE. The method is also a building block for newer attention variants: DeepSeek's Multi-Head Latent Attention keeps a decoupled RoPE component precisely because rotary embeddings cannot be folded into a compressed key-value latent without special handling.
- Parameter-free and easy to add to any attention head.
- Supports context-length extension via NTK-aware scaling and YaRN.
- Compatible with efficient attention kernels such as FlashAttention.
Key takeaways
- RoPE rotates query and key vectors by a position-dependent angle so attention scores depend on relative distance.
- It adds no learned parameters and preserves vector norms, unlike additive absolute position embeddings.
- Frequency-scaling methods like NTK-aware scaling and YaRN extend RoPE to longer contexts than seen in training.
- RoPE is the default position encoding in LLaMA, Mistral, Qwen, Gemma, and DeepSeek models.
Frequently asked questions
Related terms
Related reading
Sources
Put the idea into practice
MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.
Try MemX Free