Summaries · #data-science

DAY 01Yesterday MAY 6 · 20262 SUMMARIES

Learning DataMarketing & GrowthMay 6, 2026

Test Campaign Boosts Profit but Needs Funnel Fixes

Test campaign delivers higher revenue ($781,850 vs $758,050) and profit ($704,958 vs $691,232) with stat sig (p~0), higher CTR (10.2% vs 5.1%), but lower ROI (9.3 vs 10.6) and CAC ($4.92 vs $4.41). Scale it while targeting mid-funnel drop-offs.

Learning Data

Towards AIData Science & VisualizationMay 6, 2026

Synthetic Data Exposes Hidden ML Bias Before Production

Real training data hides bias via underrepresentation (e.g., rural at 9%), proxies, and skewed labels; generate synthetic data with controlled segments (e.g., rural at 25%) to reveal it through disaggregated AUC drops (0.791 to 0.768) and disparate impact <0.8, then retrain on mixed data to fix.

DAY 02Tuesday MAY 5 · 20261 SUMMARIES

Towards AIData Science & VisualizationMay 5, 2026

Track One User-Feature Pair to Catch ML Pipeline Bugs

A rec model's 0.91 AUC failed in prod after 4 days due to 21-hour stale user_30d_purchases features. Track user U-9842 and this feature through every pipeline layer to expose and prevent such mismatches.

Towards AI

DAY 03Monday MAY 4 · 20262 SUMMARIES

MarkTechPostData Science & VisualizationMay 4, 2026

Production ML Pipelines with ZenML: Custom Materializers & HPO

ZenML enables end-to-end ML pipelines with custom DatasetBundle materializers for metadata-rich serialization, fan-out over 4 hyperparameter configs for RandomForest/GradientBoosting/LogisticRegression, fan-in best-model selection by ROC AUC, full artifact tracking, and cache-driven reproducibility on breast cancer dataset.

MarkTechPost

Google Cloud TechAI & LLMsMay 4, 2026

Scale GenAI to Billions of Rows in BigQuery at 94% Less Cost

BigQuery's optimized mode distills LLMs into lightweight models using embeddings, slashing token use by 94% (55M to 3M) and query time from 16min to 2min on 34k images or 50k voice commands, scaling to billions of rows.

DAY 04Sunday MAY 3 · 20262 SUMMARIES

MarkTechPostData Science & VisualizationMay 3, 2026

Stream Parse TaskTrove Dataset for AI Task Insights

Stream multi-GB TaskTrove dataset without full download; parse gzip-compressed tar/zip/JSON binaries to analyze sources, sizes (median p50 KB compressed), filenames, and detect verifiers for RL-ready tasks via multi-signal heuristics.

MarkTechPost

Data Driven InvestorData Science & VisualizationMay 3, 2026

Build Queryable Options IV DB from Live API Polls

Capture SpiderRock LiveImpliedQuote snapshots for TSLA every 10s into SQLite: append full history for audits (12k+ rows in 2min), upsert latest view per option_key. Query to reconstruct vol smiles and track ATM IV/skew changes over time.

DAY 05Saturday MAY 2 · 20262 SUMMARIES

MarkTechPostAI & LLMsMay 2, 2026

Parse, Analyze, Visualize Hermes Agent Traces for Fine-Tuning

Extract thoughts/tool calls from Hermes agent dataset with regex parsers; compute stats like avg turns per trajectory, tool frequencies, error rates; visualize patterns; tokenize with assistant-only labels for SFT on Qwen models.

MarkTechPost

Data and BeyondData Science & VisualizationMay 2, 2026

Data Science Splits: Engineer Pipelines or Lead Decisions

Data scientist roles are dividing into technical data engineering (SQL up 18%, ETL up 18%) and strategic decision-making; AI automates mid-level generalist tasks, squeezing the middle—specialize in one side now.