Fix Tokenization Drift by Matching SFT Token Patterns

Leading Spaces and Formatting Create Entirely New Token Sequences

Tokenization drift occurs when subtle changes like adding a leading space alter token IDs and sequence lengths, pushing inputs outside the model's trained distribution. Using GPT-2 tokenizer (vocab size 50,257, same BPE as GPT-4/LLaMA/Mistral), test pairs like " classify" vs "classify": space version gets single token 36509, no-space splits to 4871, 1958. All 7 tested words (classify, answer, positive, negative, sentiment, output, label) produce different IDs—deltas range from <100 (low risk, e.g., label) to >500 (high risk, e.g., classify at 31,638 delta). This changes attention computation since sequence lengths differ, making "apple" and " apple" as distinct to the model as unrelated words.

SFT models learn specific structures (newlines, colons, prefixes). Deviations like removing newlines drop Jaccard overlap with canonical SFT template ("Below is a customer review. Classify the sentiment.\n\nReview: {review}\n\nSentiment:") to 80%; no leading space on "Review" to 85%; colon-to-dash to 70%; rewording instruction to 50%. Lower overlap signals higher OOD risk: >80% low risk, 60-80% medium, <60% high, correlating to accuracy drops.

Jaccard Overlap Quantifies OOD Risk from Prompt Variants

Canonical SFT overlap is 100%. Variants show: no newlines 80% (medium risk), missing space 85% (low), dash instead of colon 70% (medium), reworded ("Determine the sentiment... Answer:") 50% (high). On sample "The product exceeded all my expectations. Highly recommend!", these shifts mean the model processes unfamiliar token space, leading to unpredictable outputs despite unchanged logic or data.

Visual deltas confirm: high-ID gaps (>500) for most words indicate severe drift. Thresholds guide safety—stay above 80% overlap to mimic training distribution, avoiding degradation without retraining.

APO Loop Auto-Selects High-Overlap Prompts for Stable Performance

Implement Automated Prompt Optimization on 8-sample validation set (balanced positive/negative/neutral reviews). Test 5 candidates:

A (no formatting: "Classify: {review} Answer:");
B (minimal: "Review: {review}\nSentiment:");
C (SFT-aligned: full template with newlines/colons);
D (XML: "{review}\n");
E (full instruction: "You are a sentiment classifier... Output...").

Simulate accuracy: base 85%, scaled by overlap factor (0.5 + 0.5*Jaccard) minus OOD penalty (e.g., 0.18 for A, 0.02 for C), clipped 40-95%, plus noise. Results: A 38%, B 50%, C 88%, D 63%, E 75%. APO picks C ("Variant C -- SFT-aligned") at 88% accuracy—33% better than worst, proving closest SFT match wins.

In production, replace simulation with real model evals on validation data. Full code: https://github.com/Marktechpost/AI-Agents-Projects-Tutorials/blob/main/NLP/Tokenization_Drift.ipynb. This keeps prompts in-distribution, stabilizing performance across pipeline changes.