Word2Vec: Turning Word Neighborhoods into Embeddings
Word2Vec learns dense word vectors by predicting local contexts with CBOW or Skip-gram, clustering similar words like 'cat' and 'dog' via repeated gradient updates from shared neighborhoods.
Shift from Isolated IDs to Relational Embeddings
Before Word2Vec, words were treated as unique IDs or one-hot vectors (e.g., cat → 1,0,0,0,0), preserving identity but ignoring relationships like 'cat' closer to 'dog' than 'engine'. Word2Vec flips this by learning dense vectors where meaning emerges from context: a word's vector is shaped by its repeated local neighborhoods in text. For a tiny corpus ('the cat drinks milk', 'the dog drinks water'), 'cat' appears near 'the', 'drinks', 'milk', 'chases', 'mouse', while 'dog' shares 'the', 'drinks', 'chases' but differs on 'water', 'ball'. Similar contexts deliver matching gradient signals during training, pulling vectors like cat 0.82, 0.21, -0.05 and dog 0.79, 0.25, -0.03 into nearby regions, enabling geometric analogies like king - man + woman ≈ queen.
This relational view—words as positions in a space preserving structure—outperforms sparse representations because similar training pressures from neighborhoods create clustered embeddings without explicit semantic rules.
CBOW vs Skip-gram: Dual Paths to Context Prediction
Word2Vec optimizes dense vectors (e.g., size 3 for vocab of 9) via a simple network: one-hot input (size 9) → hidden layer (size 3) → output scores (size 9). The hidden weights form the embedding table, where each word's row (e.g., initial cat 0.11, -0.08, 0.05) gets refined.
CBOW predicts center from context (input: 'the', 'drinks' → target: 'cat'), treating surroundings as clues that constrain word identity, like recovering a word from its situational fit. Skip-gram reverses it (input: 'cat' → targets: 'the', 'drinks'), capturing a word's relational footprint—what neighbors it generates. With window size 1, Skip-gram generates pairs like cat → the, cat → drinks; CBOW inverts them.
Both unify around mutual definition: context shapes word (CBOW), word shapes context (Skip-gram). Skip-gram excels for rare words by amplifying their signal; CBOW smooths frequent ones. Together, they force embeddings to encode predictive utility, yielding a map where milk 0.10, 0.88, -0.12 clusters near water 0.07, 0.84, -0.10.
Training Mechanics: Gradients Sculpt the Space
Training slides a window over text, generating examples (e.g., center 'cat' with contexts 'the', 'drinks'). For Skip-gram on cat → the: retrieve cat's vector, compute output scores (e.g., the: 0.12 → softmax prob 0.20), measure error against target, backpropagate to nudge weights—pulling cat closer to 'the', pushing from negatives like 'engine'.
Negative sampling scales this: for cat → drinks, attract to true pair, repel 3-5 random fakes (e.g., 'banana', 'cloud'), forming geometry via affinity (pet/action contexts) and boundaries (unrelated ones). Repeated across corpus, similar contexts yield parallel updates: cat and dog, both near 'the/drinks/chases', converge without semantic labels.
Outcome: random initials become relational map. Training builds it via 'enormous tiny corrections'; full process turns prediction errors into stable positions.
Inference and Limitations in Modern Context
Post-training, discard the predictor; use the embedding matrix for lookups (cat's vector), similarity (cosine distance clusters cat/dog over cat/engine), averaging for sentences ('the cat drinks milk' → mean vector), or downstream tasks like classification.
Word2Vec revolutionized NLP by proving prediction yields emergent semantics, replacing hand-engineered features with learned geometry. Yet static vectors fail polysemy ('bank' as river/finance gets one embedding), spurring contextual models like BERT. Legacy: modern LLMs inherit context-driven, relational meaning—embeddings as vectors first, structure second.