The Problem with Standard Tokenization in Technical Domains
Standard tokenizers often fail when processing technical documents because they treat physical quantities (e.g., "10.5 kg", "220 V") as generic sequences of characters or numbers. This leads to fragmented representations where the numerical value and the unit of measurement are separated, causing models to lose the semantic relationship between the magnitude and the physical dimension. In Brazilian Portuguese, this is further complicated by specific formatting conventions and technical nomenclature that general-purpose tokenizers are not optimized to handle.
Toten: Knowledge-Based Ontological Tokenization
Toten introduces an ontological approach to tokenization, moving beyond simple whitespace or subword splitting. By integrating a knowledge base of physical quantities and technical notation, the framework identifies and preserves the integrity of scientific expressions. Instead of breaking "100 m/s" into arbitrary sub-tokens, Toten recognizes the entity as a single semantic unit. This ensures that downstream models—such as Large Language Models (LLMs) or information extraction pipelines—receive structured input that respects the underlying physics of the data.
Impact on Technical NLP
By anchoring tokenization in an ontology, Toten improves the accuracy of technical text processing in Brazilian Portuguese. This method reduces ambiguity in scientific literature, engineering documentation, and technical reports. The framework demonstrates that for domain-specific tasks, augmenting the tokenizer with domain-specific knowledge is more effective than relying on massive, general-purpose training data alone. This approach is particularly valuable for applications requiring high precision in unit conversion, scientific reasoning, and technical data extraction.