AI Foundations

Word2Vec (CBOW & Skip-gram)

By Arpit Tripathi, Founder

Word2Vec learns dense vector representations of words from raw text using a shallow neural network. Its two architectures, CBOW and skip-gram, predict words from context or context from words, placing similar words near each other in vector space.

What is Word2Vec?

Word2Vec is a method for learning dense, low-dimensional vector representations (embeddings) of words from a large unlabeled text corpus. Proposed by Tomas Mikolov and colleagues at Google in 2013, it uses a shallow two-layer neural network trained on a prediction task, and the learned weights become the word vectors.

The guiding idea is the distributional hypothesis: words that appear in similar contexts tend to have similar meanings. Word2Vec turns this into geometry. Words used in similar contexts end up with vectors that are close together, so that semantic and syntactic relationships show up as directions in the vector space.

  • Learns embeddings from unlabeled text via a self-supervised prediction task.
  • Rests on the distributional hypothesis: similar contexts imply similar meaning.
  • The network's learned weights are the final word vectors.

CBOW vs Skip-gram

Word2Vec comes in two architectures. Continuous Bag-of-Words (CBOW) predicts a target word from the average of its surrounding context words. Skip-gram does the reverse: it takes a single center word and predicts each of its surrounding context words.

CBOW is faster to train and tends to work well on frequent words, since it aggregates context. Skip-gram is slower but usually produces better representations for rare words and smaller corpora, because each context word becomes a separate training signal for the center word.

Skip-gram objective: (1/T) Σ_{t=1}^{T} Σ_{−m ≤ j ≤ m, j≠0} log p(w_{t+j} | w_t)
Skip-gram maximizes the average log-probability of context words w_{t+j} within a window of size m around each center word w_t over T tokens.
p(w_O | w_I) = exp(v'_{w_O} · v_{w_I}) / Σ_{w=1}^{V} exp(v'_w · v_{w_I})
The softmax probability of an output word given an input word, using input vectors v and output vectors v' over a vocabulary of size V.
  • CBOW: context words in, center word out; faster, smooths over frequent words.
  • Skip-gram: center word in, context words out; better for rare words and small data.
  • Both use a sliding context window whose size is a tunable hyperparameter.
  • The trained input weight matrix is the final word embedding table.

Negative sampling and efficiency

Computing the full softmax over a large vocabulary at every step is expensive, since the denominator sums over every word. Word2Vec makes training practical with two tricks: hierarchical softmax, which replaces the flat softmax with a binary tree, and negative sampling, which is the more popular choice.

Negative sampling reframes the task as binary classification. For each real (center, context) pair, the model also draws a few random words as negatives and learns to tell the true context word apart from the sampled noise words. This turns one expensive softmax into a handful of cheap logistic updates. Word2Vec also subsamples very frequent words like the and of, which improves both speed and the quality of representations for rarer, more informative words.

log σ(v'_{w_O} · v_{w_I}) + Σ_{i=1}^{k} E_{w_i ∼ P_n(w)} [ log σ(−v'_{w_i} · v_{w_I}) ]
The negative-sampling objective for one pair: push the true context word's score up and k sampled negatives' scores down, where σ is the sigmoid and P_n is the noise distribution.
  • Full softmax over the vocabulary is too costly for large corpora.
  • Negative sampling trains against a few random noise words per positive pair.
  • Frequent-word subsampling reduces the dominance of uninformative tokens.
  • Typical embedding dimensions range from about 100 to 300.

Vector arithmetic and analogies

A striking property of Word2Vec embeddings is that relationships between words appear as consistent vector offsets. The classic example is that the vector for king minus man plus woman lands near the vector for queen. These analogies emerge because the training objective encodes co-occurrence regularities as linear structure.

This makes the embeddings useful as features for downstream tasks such as text classification, named entity recognition, and semantic similarity, and as a conceptual foundation for later contextual embeddings from transformer models.

python
from gensim.models import Word2Vec

sentences = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "sat", "on", "the", "rug"],
    # ... a real corpus would have millions of tokenized sentences
]

model = Word2Vec(
    sentences,
    vector_size=100,   # embedding dimension
    window=5,          # context window size
    sg=1,              # 1 = skip-gram, 0 = CBOW
    negative=5,        # number of negative samples
    min_count=1,
    workers=4,
)

vec = model.wv["cat"]                       # the 100-d embedding for 'cat'
print(model.wv.most_similar("cat", topn=3)) # nearest neighbors by cosine
Train Word2Vec with gensim and query similar words and analogies.
  • Cosine similarity between vectors measures semantic relatedness.
  • Analogy queries return the nearest word to a combination of vectors.
  • Static embeddings: one fixed vector per word, regardless of sentence context.

Limitations and what came after

Word2Vec produces static embeddings: each word has exactly one vector regardless of how it is used, so it cannot distinguish the river bank from a savings bank. It also has no built-in handling for words unseen during training (out-of-vocabulary), a gap that subword methods like fastText later addressed.

These limits motivated contextual embeddings, where a model produces a different vector for a word depending on its sentence. ELMo, then BERT and other transformer models, generalized the embedding idea to be context-sensitive. Word2Vec remains important as a fast, interpretable baseline and as the conceptual origin of modern dense embeddings used in semantic search and vector databases.

  • One static vector per word: cannot disambiguate polysemy by context.
  • No native out-of-vocabulary handling; fastText added subword units.
  • Contextual models (BERT and successors) extended embeddings to be context-aware.
  • Still a strong, fast baseline and the ancestor of modern embedding systems.

Key takeaways

  • Word2Vec learns dense word vectors from unlabeled text using a shallow neural network.
  • CBOW predicts a word from its context; skip-gram predicts context from a word.
  • Negative sampling and frequent-word subsampling make training over large vocabularies efficient.
  • Semantic relationships appear as vector offsets, enabling analogy arithmetic like king minus man plus woman equals queen.
  • Embeddings are static (one vector per word), which contextual models like BERT later improved upon.

Frequently asked questions

CBOW predicts a target word from its surrounding context and trains faster, performing well on frequent words. Skip-gram predicts the context words from a single center word, is slower, and produces better embeddings for rare words and smaller corpora.
Negative sampling replaces the expensive full softmax with binary classification. For each true word-context pair, a few random noise words are sampled, and the model learns to separate the real context word from the noise, making training much faster.
They place semantically similar words near each other in vector space, so cosine similarity captures relatedness and relationships appear as vector offsets. The vectors serve as features for classification, similarity search, and many downstream NLP tasks.
It produces one static vector per word, so it cannot tell apart different senses of the same word by context, and it has no native handling for out-of-vocabulary words. Contextual models like BERT and subword methods like fastText address these gaps.
Word2Vec gives each word a single fixed vector regardless of sentence. BERT produces contextual embeddings: the same word gets different vectors depending on its surrounding words, capturing meaning that varies with context.