Synthetic Data Exposes Hidden ML Bias Before Production

Real Data Masks Structural Bias in Three Ways

Historical datasets embed bias because they reflect past decisions, not true merit: urban approvals at 71% due to market expansion, not creditworthiness. Standard metrics like 87% precision, 84% recall, and 0.8734 AUC pass because validation inherits the skew—rural samples are just 9% (138 vs. 1,255 in balanced data), averaging away errors.

Underrepresentation lets majority performance (urban AUC 0.884) conceal minority gaps (rural AUC 0.791). Proxy features like postcode encode protected traits indirectly. Label bias bakes in human prejudices, e.g., +10% urban approval boost. Overall metrics ignore this; disaggregation reveals predicted rural approval at 0.341 vs. true 0.412.

Synthetic data breaks the cycle by enforcing population proportions (urban 40%, suburban 35%, rural 25%), providing statistical power for audits without real data constraints.

Framework: Control Segments to Uncover Bias via Disaggregated Metrics

Generate two datasets with generate_loan_applicants: historical (urban 71.2%) and balanced. Train GradientBoostingClassifier on historical data (n_estimators=100, max_depth=4), yielding solid overall AUC 0.8734.

Evaluate by segment:

Segment	Historical (Biased)	Balanced Synthetic
Rural	AUC 0.791, Pred Approval 0.341 (true 0.412)	AUC 0.768, Pred 0.334 (true 0.418)
Suburban	AUC 0.869, 0.468 (0.471)	AUC 0.852, 0.464 (0.469)
Urban	AUC 0.884, 0.521 (0.523)	AUC 0.889, 0.524 (0.521)

Rural performance collapses when scaled, showing the model under-approves qualified applicants.

Fairness audit uses disparate impact (DI) vs. urban reference, flagging <0.8 per EEOC 80% rule:

Historical: Rural DI 0.654 (fail)
Balanced: Rural DI 0.641 (fail), suburban 0.891 (pass)

evaluate_by_segment and compute_fairness_metrics quantify gaps; Equalized Odds checks TPR parity.

Retrain on Augmented Data to Achieve Fairness Without Sacrificing Accuracy

Combine historical + balanced data, retrain: AUC drops minimally to 0.8701, rural DI rises to 0.812 (pass), all segments ≥0.80.

Checklist for production:

Segment-level AUC per group
Disaggregated prediction rates
DI ≥0.80
Equalized Odds
Retrain if fails
Revalidate

Synthetic control ensures powered audits (e.g., 1,255 rural samples); real data alone leaves small groups noisy. Test on balanced synthetic first to catch bias pre-production.

Real Data Masks Structural Bias in Three Ways

Framework: Control Segments to Uncover Bias via Disaggregated Metrics

Retrain on Augmented Data to Achieve Fairness Without Sacrificing Accuracy

More from Data Science & Visualization

Momentum Dampens GD Zigzags via Gradient Averaging

Track One User-Feature Pair to Catch ML Pipeline Bugs

Production ML Pipelines with ZenML: Custom Materializers & HPO

Stream Parse TaskTrove Dataset for AI Task Insights