Real Data Masks Structural Bias in Three Ways
Historical datasets embed bias because they reflect past decisions, not true merit: urban approvals at 71% due to market expansion, not creditworthiness. Standard metrics like 87% precision, 84% recall, and 0.8734 AUC pass because validation inherits the skew—rural samples are just 9% (138 vs. 1,255 in balanced data), averaging away errors.
Underrepresentation lets majority performance (urban AUC 0.884) conceal minority gaps (rural AUC 0.791). Proxy features like postcode encode protected traits indirectly. Label bias bakes in human prejudices, e.g., +10% urban approval boost. Overall metrics ignore this; disaggregation reveals predicted rural approval at 0.341 vs. true 0.412.
Synthetic data breaks the cycle by enforcing population proportions (urban 40%, suburban 35%, rural 25%), providing statistical power for audits without real data constraints.
Framework: Control Segments to Uncover Bias via Disaggregated Metrics
Generate two datasets with generate_loan_applicants: historical (urban 71.2%) and balanced. Train GradientBoostingClassifier on historical data (n_estimators=100, max_depth=4), yielding solid overall AUC 0.8734.
Evaluate by segment:
| Segment | Historical (Biased) | Balanced Synthetic |
|---|---|---|
| Rural | AUC 0.791, Pred Approval 0.341 (true 0.412) | AUC 0.768, Pred 0.334 (true 0.418) |
| Suburban | AUC 0.869, 0.468 (0.471) | AUC 0.852, 0.464 (0.469) |
| Urban | AUC 0.884, 0.521 (0.523) | AUC 0.889, 0.524 (0.521) |
Rural performance collapses when scaled, showing the model under-approves qualified applicants.
Fairness audit uses disparate impact (DI) vs. urban reference, flagging <0.8 per EEOC 80% rule:
- Historical: Rural DI 0.654 (fail)
- Balanced: Rural DI 0.641 (fail), suburban 0.891 (pass)
evaluate_by_segment and compute_fairness_metrics quantify gaps; Equalized Odds checks TPR parity.
Retrain on Augmented Data to Achieve Fairness Without Sacrificing Accuracy
Combine historical + balanced data, retrain: AUC drops minimally to 0.8701, rural DI rises to 0.812 (pass), all segments ≥0.80.
Checklist for production:
- Segment-level AUC per group
- Disaggregated prediction rates
- DI ≥0.80
- Equalized Odds
- Retrain if fails
- Revalidate
Synthetic control ensures powered audits (e.g., 1,255 rural samples); real data alone leaves small groups noisy. Test on balanced synthetic first to catch bias pre-production.