Synthetically Label Sparse Bequest Donors Realistically

Engineer RFMT-age-RG propensity scores with sector-specific bins (e.g., recency sweet spot 18-42mo=5pts) and stochastic noise to create 'Confirmed' labels, preventing models from overfitting formulas in <1% positive charity data.

Tackle Imbalanced Bequest Data with Synthetic Targets

Charity databases have <1% confirmed bequest donors—those formally notifying intent—despite >50% of gifts coming from lifetime strangers. Build a realistic target bequest_status ('Confirmed' or NA) using a propensity formula on RFMT (recency/frequency/monetary/tenure), age groups, and regular giving (RG) status. Add controlled randomness via Bernoulli sampling on propensity probability to mimic human variability and block model 'cheating'—where deterministic labels let algorithms rediscover the exact formula, creating an echo chamber.

Max propensity normalizes to ~357 (sum of peak scores: r=5,f=10,m=3,t=10,age=10x2=20 * rg=1.2), yielding probs like 0.089 for high scorers. This forces models to extract true signals amid noise, mirroring real sparse data.

Engineer RFMT, Age, and RG Features from Transactions

Start with df_opps (opportunities) and df_contacts:

  • RFMT: Group by contact_id; compute last_gift_date (max close_date), first_gift_date (min), frequency (count amount), monetary_value (sum amount). Then recency = months since end_date (2025-12-31); tenure = months between first/last gift.
def generate_rfmt(data):
    df = data.groupby('contact_id').agg({
        'close_date': ['max', 'min'],
        'amount': ['count', 'sum']
    })
    df.columns = ['last_gift_date', 'first_gift_date', 'frequency', 'monetary_value']
    # Convert to date, compute recency/tenure with relativedelta
    # ...
    return df.reset_index()
  • Age groups: pd.cut(age, bins=[0,39,49,59,69,90], labels=['under_40','40-49','50-59','60-69','70_or_over']).
  • RG status: Filter df_opps[type=='Regular']; get first_rg_date/last_rg_date per ID. If last_rg_date in 2025-12: 'Active'; else 'Cancelled'. No RG → 'No RG' post-merge.

Merge right on RFMT (drop no-history contacts), left on RG; fillna 'No RG'; drop extras like name/gender.

Sector-Tailored Scores Capture Counterintuitive Patterns

Assign 0-10 scores per feature, weighted for legacy giving realities (e.g., retired lapsed donors outscore active; mid-value > high-value):

FeatureBins/LogicLabelsRationale
Recency[-1,18,42,84,1000]4,5,2,118-42mo 'sweet spot' for retired lapsed (highest); recent active lower; long dormant still viable. pd.cut.
Frequency[-1,2,9,49,99,10000]0,1,4,7,10Frequency > value; 100+ 'Revolutionary'=10. pd.cut.
Monetary (quintiles)pd.qcut(q=5, labels=[1,2,3,4,5]) → map {1:0,2:2,3:3,4:3,5:1}Peak mid-quintilesMid-value (40-80%) most generous legacies; top 20% less confirmatory.
Tenurepd.cut(bins=5)0,1,3,6,10Long tenure >> short; steep curve for loyalty.
AgeMap groups{'under_40':0,'40-49':1,'50-59':3,'60-69':7,'70+':10}Exponential post-60; doubled in formula, not gated.
RG Weight (multiplier)Map{'Cancelled':1.2,'Active':1.0,'No RG':0.5}Lapsed RG strong signal of estate shift.

Raw propensity = (r_score + f_score + m_score + t_score + 2*age_score) * rg_weight. E.g., high-freq recent-lapsed 70+: ~31.8 (prob 0.089); low everything: ~1 (prob 0.003).

Stochastic Assignment Mimics Real Donor Behavior

Convert raw_propensity to assignment_prob (e.g., /357 for 0-1 scale), then bequest_status = np.random.binomial(1, prob) → 'Confirmed' if 1. This injects noise: perfect scorers sometimes miss, low scorers occasionally confirm—breaking determinism so downstream classifiers learn generalizable patterns, not the formula.

Summarized by x-ai/grok-4.1-fast via openrouter

9589 input / 2408 output tokens in 16814ms

© 2026 Edge