Synthetically Label Sparse Bequest Donors Realistically

Tackle Imbalanced Bequest Data with Synthetic Targets

Charity databases have <1% confirmed bequest donors—those formally notifying intent—despite >50% of gifts coming from lifetime strangers. Build a realistic target bequest_status ('Confirmed' or NA) using a propensity formula on RFMT (recency/frequency/monetary/tenure), age groups, and regular giving (RG) status. Add controlled randomness via Bernoulli sampling on propensity probability to mimic human variability and block model 'cheating'—where deterministic labels let algorithms rediscover the exact formula, creating an echo chamber.

Max propensity normalizes to ~357 (sum of peak scores: r=5,f=10,m=3,t=10,age=10x2=20 * rg=1.2), yielding probs like 0.089 for high scorers. This forces models to extract true signals amid noise, mirroring real sparse data.

Engineer RFMT, Age, and RG Features from Transactions

Start with df_opps (opportunities) and df_contacts:

RFMT: Group by contact_id; compute last_gift_date (max close_date), first_gift_date (min), frequency (count amount), monetary_value (sum amount). Then recency = months since end_date (2025-12-31); tenure = months between first/last gift.

def generate_rfmt(data):
    df = data.groupby('contact_id').agg({
        'close_date': ['max', 'min'],
        'amount': ['count', 'sum']
    })
    df.columns = ['last_gift_date', 'first_gift_date', 'frequency', 'monetary_value']
    # Convert to date, compute recency/tenure with relativedelta
    # ...
    return df.reset_index()

Age groups: pd.cut(age, bins=[0,39,49,59,69,90], labels=['under_40','40-49','50-59','60-69','70_or_over']).
RG status: Filter df_opps[type=='Regular']; get first_rg_date/last_rg_date per ID. If last_rg_date in 2025-12: 'Active'; else 'Cancelled'. No RG → 'No RG' post-merge.

Merge right on RFMT (drop no-history contacts), left on RG; fillna 'No RG'; drop extras like name/gender.

Sector-Tailored Scores Capture Counterintuitive Patterns

Assign 0-10 scores per feature, weighted for legacy giving realities (e.g., retired lapsed donors outscore active; mid-value > high-value):

Feature	Bins/Logic	Labels	Rationale
Recency	`[-1,18,42,84,1000]`	4,5,2,1	18-42mo 'sweet spot' for retired lapsed (highest); recent active lower; long dormant still viable. `pd.cut`.
Frequency	`[-1,2,9,49,99,10000]`	0,1,4,7,10	Frequency > value; 100+ 'Revolutionary'=10. `pd.cut`.
Monetary (quintiles)	`pd.qcut(q=5, labels=[1,2,3,4,5])` → map `{1:0,2:2,3:3,4:3,5:1}`	Peak mid-quintiles	Mid-value (40-80%) most generous legacies; top 20% less confirmatory.
Tenure	`pd.cut(bins=5)`	0,1,3,6,10	Long tenure >> short; steep curve for loyalty.
Age	Map groups	{'under_40':0,'40-49':1,'50-59':3,'60-69':7,'70+':10}	Exponential post-60; doubled in formula, not gated.
RG Weight (multiplier)	Map	{'Cancelled':1.2,'Active':1.0,'No RG':0.5}	Lapsed RG strong signal of estate shift.

Raw propensity = (r_score + f_score + m_score + t_score + 2*age_score) * rg_weight. E.g., high-freq recent-lapsed 70+: ~31.8 (prob 0.089); low everything: ~1 (prob 0.003).

Stochastic Assignment Mimics Real Donor Behavior

Convert raw_propensity to assignment_prob (e.g., /357 for 0-1 scale), then bequest_status = np.random.binomial(1, prob) → 'Confirmed' if 1. This injects noise: perfect scorers sometimes miss, low scorers occasionally confirm—breaking determinism so downstream classifiers learn generalizable patterns, not the formula.

Tackle Imbalanced Bequest Data with Synthetic Targets

Engineer RFMT, Age, and RG Features from Transactions

Sector-Tailored Scores Capture Counterintuitive Patterns

Stochastic Assignment Mimics Real Donor Behavior

More from Data Science & Visualization

skfolio: Build & Tune Portfolio Optimizers in Python

Scanpy Pipeline for PBMC scRNA-seq Clustering & Trajectories

TabPFN Beats Tree Models on Tabular Accuracy with Zero Training

6 Habits That Elevate Data Science Projects Beyond Model Selection