AUC 0.65 Perfectly Captures Noisy Bequest Signals

Prioritize Credibility Over Metrics in Imbalanced Classification

With only 181 confirmed bequest donors (3.6% minority class) in 5,000 records, skip hyperparameter tuning—it overfits unstable signals—and SMOTE, which invents synthetic positives atop an already artificial dataset, masking true imbalance. Instead, use stratified 80/20 train-test splits to preserve 3.6% positives in both (36 in test), scale only numerics (frequency, monetary_value, recency, tenure) via StandardScaler while leaving one-hot dummies (age groups, rg_status) unscaled for interpretability, and set XGBoost's scale_pos_weight to negative/positive ratio (96:1) for minority focus.

Logistic regression baseline yields ROC-AUC 0.72 but zero positive precision/recall, defaulting to majority class predictions (confusion: [[964,0],36,0]). This exposes imbalance's pull toward safe, trivial accuracy (96%). XGBoost (n_estimators=100, learning_rate=0.1, max_depth=3) counters it, achieving ROC-AUC 0.65, precision 0.07 (7/100 flagged are true; vs. random 0.036), recall 0.47, accuracy 0.74 (confusion: [[720,244],19,17]). False positives (244) are cheap—$50k gift from one true positive (17 found) justifies mailing costs.

SHAP Exposes Actionable Donor Drivers

SHAP values decompose predictions, revealing feature impacts: longer tenure pushes strongest toward bequest (top-ranked, high values positive); age_70_or_over and age_60-69 follow positively (vs. reference age_40-49); age_under_40 and age_50-59 negatively. High recency (recent giving) and high monetary_value deter (mid-value sweet spot); higher frequency boosts. rg_No_RG weakly negative vs. active; rg_Cancelled muted despite 1.2x propensity boost, as tenure/age dominate.

Model reconstructs non-linear domain logic (binned t_score, r_score from raw tenure/recency) through noise, aligning with fundraising wisdom: lapsed mid-value loyalists over recent high-givers. No perfect AUC=1.0—intentional stochastic assignment (propensity prob + np.random.rand()) and wildcards (high-prop no-gift, low-prop yes) ensure overlap, mimicking human unpredictability.

Domain Knowledge Trumps Tools for Realistic Modeling

Synthetic realism stems from rules like 80/20 Pareto (donations), seasonal peaks (June/Dec), lapsed > recent prospects—not Faker. Raw features force model to infer scored logic, paralleling real data sans internal 'loyalty scores'. AUC 0.65 admits faint signals (twice random precision, half positives caught) without hype, enabling stewardship: target long-tenured 60+ low-recency for brochures. Next: probe retention via second-gift/cohort rates to gauge base health beyond lagging metrics.