Word2Vec learns dense vector representations of words from raw text using a shallow neural network. Its two architectures, CBOW and skip-gram, predict words from context or context from words, placing similar words near each other in vector space.
What is Word2Vec?
Word2Vec is a method for learning dense, low-dimensional vector representations (embeddings) of words from a large unlabeled text corpus. Proposed by Tomas Mikolov and colleagues at Google in 2013, it uses a shallow two-layer neural network trained on a prediction task, and the learned weights become the word vectors.
The guiding idea is the distributional hypothesis: words that appear in similar contexts tend to have similar meanings. Word2Vec turns this into geometry. Words used in similar contexts end up with vectors that are close together, so that semantic and syntactic relationships show up as directions in the vector space.
- Learns embeddings from unlabeled text via a self-supervised prediction task.
- Rests on the distributional hypothesis: similar contexts imply similar meaning.
- The network's learned weights are the final word vectors.
CBOW vs Skip-gram
Word2Vec comes in two architectures. Continuous Bag-of-Words (CBOW) predicts a target word from the average of its surrounding context words. Skip-gram does the reverse: it takes a single center word and predicts each of its surrounding context words.
CBOW is faster to train and tends to work well on frequent words, since it aggregates context. Skip-gram is slower but usually produces better representations for rare words and smaller corpora, because each context word becomes a separate training signal for the center word.
- CBOW: context words in, center word out; faster, smooths over frequent words.
- Skip-gram: center word in, context words out; better for rare words and small data.
- Both use a sliding context window whose size is a tunable hyperparameter.
- The trained input weight matrix is the final word embedding table.
Negative sampling and efficiency
Computing the full softmax over a large vocabulary at every step is expensive, since the denominator sums over every word. Word2Vec makes training practical with two tricks: hierarchical softmax, which replaces the flat softmax with a binary tree, and negative sampling, which is the more popular choice.
Negative sampling reframes the task as binary classification. For each real (center, context) pair, the model also draws a few random words as negatives and learns to tell the true context word apart from the sampled noise words. This turns one expensive softmax into a handful of cheap logistic updates. Word2Vec also subsamples very frequent words like the and of, which improves both speed and the quality of representations for rarer, more informative words.
- Full softmax over the vocabulary is too costly for large corpora.
- Negative sampling trains against a few random noise words per positive pair.
- Frequent-word subsampling reduces the dominance of uninformative tokens.
- Typical embedding dimensions range from about 100 to 300.
Vector arithmetic and analogies
A striking property of Word2Vec embeddings is that relationships between words appear as consistent vector offsets. The classic example is that the vector for king minus man plus woman lands near the vector for queen. These analogies emerge because the training objective encodes co-occurrence regularities as linear structure.
This makes the embeddings useful as features for downstream tasks such as text classification, named entity recognition, and semantic similarity, and as a conceptual foundation for later contextual embeddings from transformer models.
- Cosine similarity between vectors measures semantic relatedness.
- Analogy queries return the nearest word to a combination of vectors.
- Static embeddings: one fixed vector per word, regardless of sentence context.
Limitations and what came after
Word2Vec produces static embeddings: each word has exactly one vector regardless of how it is used, so it cannot distinguish the river bank from a savings bank. It also has no built-in handling for words unseen during training (out-of-vocabulary), a gap that subword methods like fastText later addressed.
These limits motivated contextual embeddings, where a model produces a different vector for a word depending on its sentence. ELMo, then BERT and other transformer models, generalized the embedding idea to be context-sensitive. Word2Vec remains important as a fast, interpretable baseline and as the conceptual origin of modern dense embeddings used in semantic search and vector databases.
- One static vector per word: cannot disambiguate polysemy by context.
- No native out-of-vocabulary handling; fastText added subword units.
- Contextual models (BERT and successors) extended embeddings to be context-aware.
- Still a strong, fast baseline and the ancestor of modern embedding systems.
Key takeaways
- Word2Vec learns dense word vectors from unlabeled text using a shallow neural network.
- CBOW predicts a word from its context; skip-gram predicts context from a word.
- Negative sampling and frequent-word subsampling make training over large vocabularies efficient.
- Semantic relationships appear as vector offsets, enabling analogy arithmetic like king minus man plus woman equals queen.
- Embeddings are static (one vector per word), which contextual models like BERT later improved upon.
Frequently asked questions
Related terms
Related reading
Sources
Put the idea into practice
MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.
Try MemX Free