Edge
Subscribe
№ 02 / SUMMARIES

#data-science

Every summary, chronological. Filter by category, tag, or source from the rail.

Tag · #data-science
DAY 01Yesterday MAY 6 · 20262 SUMMARIES
Learning DataMarketing & Growth

Test Campaign Boosts Profit but Needs Funnel Fixes

Test campaign delivers higher revenue ($781,850 vs $758,050) and profit ($704,958 vs $691,232) with stat sig (p~0), higher CTR (10.2% vs 5.1%), but lower ROI (9.3 vs 10.6) and CAC ($4.92 vs $4.41). Scale it while targeting mid-funnel drop-offs.

Learning Data
Towards AIData Science & Visualization

Synthetic Data Exposes Hidden ML Bias Before Production

Real training data hides bias via underrepresentation (e.g., rural at 9%), proxies, and skewed labels; generate synthetic data with controlled segments (e.g., rural at 25%) to reveal it through disaggregated AUC drops (0.791 to 0.768) and disparate impact <0.8, then retrain on mixed data to fix.

DAY 02Tuesday MAY 5 · 20261 SUMMARIES
Towards AIData Science & Visualization

Track One User-Feature Pair to Catch ML Pipeline Bugs

A rec model's 0.91 AUC failed in prod after 4 days due to 21-hour stale user_30d_purchases features. Track user U-9842 and this feature through every pipeline layer to expose and prevent such mismatches.

Towards AI
DAY 03Monday MAY 4 · 20262 SUMMARIES
MarkTechPostData Science & Visualization

Production ML Pipelines with ZenML: Custom Materializers & HPO

ZenML enables end-to-end ML pipelines with custom DatasetBundle materializers for metadata-rich serialization, fan-out over 4 hyperparameter configs for RandomForest/GradientBoosting/LogisticRegression, fan-in best-model selection by ROC AUC, full artifact tracking, and cache-driven reproducibility on breast cancer dataset.

MarkTechPost
Google Cloud TechAI & LLMs

Scale GenAI to Billions of Rows in BigQuery at 94% Less Cost

BigQuery's optimized mode distills LLMs into lightweight models using embeddings, slashing token use by 94% (55M to 3M) and query time from 16min to 2min on 34k images or 50k voice commands, scaling to billions of rows.

DAY 04Sunday MAY 3 · 20262 SUMMARIES
MarkTechPostData Science & Visualization

Stream Parse TaskTrove Dataset for AI Task Insights

Stream multi-GB TaskTrove dataset without full download; parse gzip-compressed tar/zip/JSON binaries to analyze sources, sizes (median p50 KB compressed), filenames, and detect verifiers for RL-ready tasks via multi-signal heuristics.

MarkTechPost
Data Driven InvestorData Science & Visualization

Build Queryable Options IV DB from Live API Polls

Capture SpiderRock LiveImpliedQuote snapshots for TSLA every 10s into SQLite: append full history for audits (12k+ rows in 2min), upsert latest view per option_key. Query to reconstruct vol smiles and track ATM IV/skew changes over time.

DAY 05Saturday MAY 2 · 20262 SUMMARIES
MarkTechPostAI & LLMs

Parse, Analyze, Visualize Hermes Agent Traces for Fine-Tuning

Extract thoughts/tool calls from Hermes agent dataset with regex parsers; compute stats like avg turns per trajectory, tool frequencies, error rates; visualize patterns; tokenize with assistant-only labels for SFT on Qwen models.

MarkTechPost
Data and BeyondData Science & Visualization

Data Science Splits: Engineer Pipelines or Lead Decisions

Data scientist roles are dividing into technical data engineering (SQL up 18%, ETL up 18%) and strategic decision-making; AI automates mid-level generalist tasks, squeezing the middle—specialize in one side now.