Sentences Define Word Meanings via Self-Attention

Sequential Architectures Failed to Capture Full Context

Pre-Transformer models processed language word-by-word, causing inevitable information loss. RNNs from the late 1980s suffered vanishing gradients, where early words faded by sentence end—like a goldfish memory in long sequences. LSTMs (1997) added forget, input, and output gates to selectively retain info, powering Google Translate and Gmail Smart Reply, but tripled parameters and computation costs. GRUs (2014) merged gates for half the compute with similar performance. Seq2Seq models also compressed entire inputs into fixed-size vectors for tasks like translation, creating bottlenecks where long inputs lost early details—short sentences worked, but nuance blurred in longer ones. All shared a core limit: sequential processing prevented parallel handling, capping scalability for documents beyond hundreds of words.

Self-Attention Enables Sentence-Level Meaning Resolution

The 2017 'Attention Is All You Need' paper by eight Google engineers introduced Transformers, ditching RNNs/LSTMs/GRUs for parallel processing via self-attention. Every word simultaneously queries every other: 'How relevant are you to me?' This dynamically adjusts representations based on full context. For 'I bought apple to eat,' 'apple' weights 'eat' and 'bought' toward fruit; in 'I bought Apple stock to sell,' it shifts to company. Ambiguous pronouns resolve naturally, as in 'The trophy did not fit in the suitcase because it was too big'—full sentence clarifies 'it' as suitcase. Mimicking human reading (whole-sentence intake), this eliminates fixed meanings for words like 'bank' (river/money) or 'apple' (fruit/company), deriving them from sentence signals. Original Transformer trained in 3.5 days on eight GPUs, beating benchmarks.

Transformers Scale to Power All Modern LLMs

OpenAI's GPT series built directly on this: GPT-1 (117M parameters) to GPT-4 (>1T estimated), all using self-attention for billions of relevance computations per second. Every chatbot (ChatGPT, Claude), autocomplete, and LLM since runs this core operation, replacing fading memories and bottlenecks. Words lack inherent meaning—sentences solve them as variables, a truth machines grasped only after 30 years and one six-page paper.