Synthetically Label Sparse Bequest Donors Realistically
Engineer RFMT-age-RG propensity scores with sector-specific bins (e.g., recency sweet spot 18-42mo=5pts) and stochastic noise to create 'Confirmed' labels, preventing models from overfitting formulas in <1% positive charity data.
Tackle Imbalanced Bequest Data with Synthetic Targets
Charity databases have <1% confirmed bequest donors—those formally notifying intent—despite >50% of gifts coming from lifetime strangers. Build a realistic target bequest_status ('Confirmed' or NA) using a propensity formula on RFMT (recency/frequency/monetary/tenure), age groups, and regular giving (RG) status. Add controlled randomness via Bernoulli sampling on propensity probability to mimic human variability and block model 'cheating'—where deterministic labels let algorithms rediscover the exact formula, creating an echo chamber.
Max propensity normalizes to ~357 (sum of peak scores: r=5,f=10,m=3,t=10,age=10x2=20 * rg=1.2), yielding probs like 0.089 for high scorers. This forces models to extract true signals amid noise, mirroring real sparse data.
Engineer RFMT, Age, and RG Features from Transactions
Start with df_opps (opportunities) and df_contacts:
- RFMT: Group by
contact_id; computelast_gift_date(maxclose_date),first_gift_date(min),frequency(countamount),monetary_value(sumamount). Thenrecency= months since end_date (2025-12-31);tenure= months between first/last gift.
def generate_rfmt(data):
df = data.groupby('contact_id').agg({
'close_date': ['max', 'min'],
'amount': ['count', 'sum']
})
df.columns = ['last_gift_date', 'first_gift_date', 'frequency', 'monetary_value']
# Convert to date, compute recency/tenure with relativedelta
# ...
return df.reset_index()
- Age groups:
pd.cut(age, bins=[0,39,49,59,69,90], labels=['under_40','40-49','50-59','60-69','70_or_over']). - RG status: Filter
df_opps[type=='Regular']; getfirst_rg_date/last_rg_dateper ID. Iflast_rg_datein 2025-12: 'Active'; else 'Cancelled'. No RG → 'No RG' post-merge.
Merge right on RFMT (drop no-history contacts), left on RG; fillna 'No RG'; drop extras like name/gender.
Sector-Tailored Scores Capture Counterintuitive Patterns
Assign 0-10 scores per feature, weighted for legacy giving realities (e.g., retired lapsed donors outscore active; mid-value > high-value):
| Feature | Bins/Logic | Labels | Rationale |
|---|---|---|---|
| Recency | [-1,18,42,84,1000] | 4,5,2,1 | 18-42mo 'sweet spot' for retired lapsed (highest); recent active lower; long dormant still viable. pd.cut. |
| Frequency | [-1,2,9,49,99,10000] | 0,1,4,7,10 | Frequency > value; 100+ 'Revolutionary'=10. pd.cut. |
| Monetary (quintiles) | pd.qcut(q=5, labels=[1,2,3,4,5]) → map {1:0,2:2,3:3,4:3,5:1} | Peak mid-quintiles | Mid-value (40-80%) most generous legacies; top 20% less confirmatory. |
| Tenure | pd.cut(bins=5) | 0,1,3,6,10 | Long tenure >> short; steep curve for loyalty. |
| Age | Map groups | {'under_40':0,'40-49':1,'50-59':3,'60-69':7,'70+':10} | Exponential post-60; doubled in formula, not gated. |
| RG Weight (multiplier) | Map | {'Cancelled':1.2,'Active':1.0,'No RG':0.5} | Lapsed RG strong signal of estate shift. |
Raw propensity = (r_score + f_score + m_score + t_score + 2*age_score) * rg_weight. E.g., high-freq recent-lapsed 70+: ~31.8 (prob 0.089); low everything: ~1 (prob 0.003).
Stochastic Assignment Mimics Real Donor Behavior
Convert raw_propensity to assignment_prob (e.g., /357 for 0-1 scale), then bequest_status = np.random.binomial(1, prob) → 'Confirmed' if 1. This injects noise: perfect scorers sometimes miss, low scorers occasionally confirm—breaking determinism so downstream classifiers learn generalizable patterns, not the formula.