DistilBERT Predicts Root Causes from Customer Contacts
Fine-tune DistilBERT on 21,500 synthetic service records to generate top-5 root cause hypotheses from contact drivers, surfacing rare issues via low-confidence signals while avoiding over-reliance on top-1 predictions.
Prototype Design Accelerates Root Cause Investigations
Customer service detects operational symptoms like payment failures or delivery delays before root causes in product, engineering, or logistics emerge. This DistilBERT sequence classification model uses contact driver text plus categorical context (business type, product category, specialization) to predict from 912 possible root causes. Built as a one-month PoC with Streamlit UI, it outputs top-5 hypotheses with probabilities visualized in Plotly bar charts, enabling analysts to prioritize investigations without mature data infrastructure.
Synthetic dataset of 21,500 interactions mimics real patterns across e-commerce, SaaS, banking: 307 contact driver categories map to 424 root cause categories. Input combines text fields into one representation; LabelEncoder handles multi-class output. Train/validation split fine-tunes distilbert-base-uncased for 3 epochs, dropping loss from 12,594 to 3,482 and boosting validation accuracy to 0.9665 on clean data—promising but limited by synthetic nature.
Model totals 67.6M parameters: DistilBERT backbone (98%) for language understanding, 1.29M-parameter classification head adapts to task. L2 norms of class weights form bell curve (mean 0.75, range 0.643-0.865), with frequent issues like vulnerability patches stronger than rare ones like data breaches, reflecting training priorities.
Classification Head Reveals Distinguishability and Confusion Risks
Cosine similarity averages 0.184 across 912 classes, indicating good separation, but semantic clusters exceed 0.5: e.g., Credit Limit Errors vs. Fraudulent Transaction Flags (0.53), Charging Speed Problem vs. Charging Station Compatibility (0.51). Target these for extra context or human review, as similar symptoms yield plausible confusions.
Bias terms stay neutral (-0.008 to +0.002), avoiding skewed priors. Test case "airbag not functioning" ranks Airbag Deployment Sensor Fault in top-5 at 0.01 probability—weak mathematically, vital logically for safety-critical signals.
Confidence Paradox Demands Top-5 Over Top-1 Focus
High-confidence frequent predictions mask rare, counterintuitive causes; correlation ≠ causation (e.g., website error from payment provider). Traps: common patterns hide outliers; identical symptoms span failures like delivery delays from logistics, inventory, or suppliers.
Optimal workflow: Model proposes hypotheses → Humans add domain logic → Validate with evidence. Top-5 recall catches low-confidence valuables; evaluate via top-k metrics, not just accuracy. Repo includes Streamlit app, notebook for EDA/training, but omits dataset/models—use as reference, not repro.
Path to Production: Evidence Over Pure Prediction
Replace synthetic data with anonymized real logs; add calibration, explainability (e.g., evidence for/against hypotheses), feedback loops from confirmations, RAG from incident docs. Safeguard rare/critical classes. Shifts AI from decider to accelerator: structure daily symptoms into actionable starting points, blending probabilities with causality checks.