CATEGORY · 8 OF 11

Data Science & Visualization

Statistics and storytelling. Distributions, dashboards, charts that communicate, and the analysis discipline behind defensible product decisions.

67SUMMARIES
+3THIS WEEK
15SOURCES
Category · Data Science & Visualization
DAY 01Wednesday JUN 17 · 20261 SUMMARIES
Python in Plain EnglishData Science & Visualization

6 Habits That Elevate Data Science Projects Beyond Model Selection

Exceptional data science outcomes depend less on complex algorithms and more on disciplined fundamentals like data auditing, version control, and rigorous documentation.

Python in Plain English
DAY 02Monday JUN 15 · 20261 SUMMARIES
Level Up CodingData Science & Visualization

Why Accuracy Metrics Hide ML Model Failures

High accuracy scores in automated systems like résumé classifiers often mask systemic biases and data quality issues that lead to unfair rejection patterns.

Level Up Coding
DAY 03June 13, 2026 JUN 13 · 20261 SUMMARIES
MarkTechPostData Science & Visualization

Spatial Graph Neural Networks for Urban Function Inference

A practical pipeline for urban function inference using city2graph, OSMnx, and PyTorch Geometric to classify POIs based on spatial relationships and graph topology.

MarkTechPost
DAY 04June 12, 2026 JUN 12 · 20261 SUMMARIES
MarkTechPostData Science & Visualization

Building 3D Medical Segmentation Pipelines with MONAI

This tutorial demonstrates an end-to-end 3D spleen segmentation pipeline using MONAI and a 3D UNet, covering data preprocessing, patch-based training, and sliding-window inference.

MarkTechPost
DAY 05June 8, 2026 JUN 8 · 20261 SUMMARIES
arXiv cs.AIData Science & Visualization

CrowdMath: A New Dataset for Mathematical Research Reasoning

CrowdMath is a new dataset derived from crowdsourced mathematical research discussions, designed to improve AI reasoning capabilities in complex, multi-step mathematical domains.

arXiv cs.AI
DAY 06June 6, 2026 JUN 6 · 20262 SUMMARIES
MarkTechPostData Science & Visualization

Building a Semantic Search and Classifier for ResearchMath-14k

This tutorial demonstrates how to build a semantic search engine and status classifier for the ResearchMath-14k dataset using sentence embeddings, TF-IDF, and logistic regression.

MarkTechPost
Level Up CodingData Science & Visualization

Why Singular Value Decomposition Outperforms Eigen Decomposition

While eigenvectors identify stable directions in square matrices, Singular Value Decomposition (SVD) provides a more robust, universal framework for analyzing the rectangular matrices found in modern neural networks.

DAY 07June 4, 2026 JUN 4 · 20261 SUMMARIES
Python in Plain EnglishData Science & Visualization

Essential NumPy Concepts for Practical Data Science

Mastering eight core NumPy concepts—from vectorization to broadcasting—provides the foundation for 80% of daily data science tasks in Python.

Python in Plain English
DAY 08May 22, 2026 MAY 22 · 20262 SUMMARIES
Python in Plain EnglishData Science & Visualization

Predicting US Recessions with DTW and Boosted Trees

A framework for predicting economic cycles by using Dynamic Time Warping to align yield curve data, followed by boosted tree modeling and AWS containerized deployment.

Python in Plain English
Level Up CodingData Science & Visualization

Demystifying ML Math: From Vectors to Eigenvalues

Machine learning math is often obscured by intimidating terminology. Practitioners view these concepts as tools for structuring data, measuring change, and quantifying uncertainty in decision-making.

DAY 09May 18, 2026 MAY 18 · 20261 SUMMARIES
Level Up CodingData Science & Visualization

Mastering Step Plots in Matplotlib

Step plots are superior to standard line plots for visualizing incremental state changes, such as inventory levels, interest rates, or discrete signals, where transitions are abrupt rather than gradual.

Level Up Coding
DAY 10May 12, 2026 MAY 12 · 20261 SUMMARIES
MarkTechPostData Science & Visualization

skfolio: Build & Tune Portfolio Optimizers in Python

skfolio's scikit-learn API lets you construct, validate, and compare 18+ portfolio strategies—from baselines to HRP, Black-Litterman, factors, and tuned models—on S&P 500 returns with walk-forward CV and GridSearchCV.

MarkTechPost
DAY 11May 10, 2026 MAY 10 · 20261 SUMMARIES
Towards AIData Science & Visualization

Reproduce 2011 Sentiment Word Vectors in Python

Build sentiment-aware word embeddings from IMDb reviews via semantic learning with star ratings and linear SVM classification, reproducing Maas et al. (2011) – simple method rivals modern LLMs.

Towards AI
DAY 12May 8, 2026 MAY 8 · 20262 SUMMARIES
MarkTechPostData Science & Visualization

Scanpy Pipeline for PBMC scRNA-seq Clustering & Trajectories

Process PBMC-3k data with Scanpy: filter cells (min 200 genes, <2500 genes, <5% mt), remove Scrublet doublets, select HVGs (min_mean=0.0125, max_mean=3, min_disp=0.5), Leiden cluster at res=0.5, annotate via markers, infer PAGA/DPT trajectories, score IFN response.

MarkTechPost
AI Simplified in Plain EnglishData Science & Visualization

NMI Bias Favors Complex Clusters Over Insight

Normalized Mutual Information (NMI) rewards over-segmentation and complexity in clustering, inflating scores for intuitively poor algorithms and distorting AI evaluations.

DAY 13May 7, 2026 MAY 7 · 20263 SUMMARIES
Data and BeyondData Science & Visualization

Balance Linear Simplicity and Nonlinear Flexibility to Avoid Fit Failures

Linear models underfit nonlinear data with rigid straight boundaries; nonlinear models overfit by memorizing noise with wiggly curves. Fix via bias-variance tradeoff for optimal generalization.

Data and Beyond
Towards AIData Science & Visualization

Time Series Fundamentals Before Modeling

Time series data depends on order—avoid shuffling or random splits. Decompose into trend, seasonality, cycles, noise; ensure stationarity (constant mean/variance/autocovariance) via differencing, logs, detrending; diagnose with ACF/PACF for AR/MA patterns.

Towards AIData Science & Visualization

Triple YOLO Recall with Adaptive Post-Processing

In crowded scenes, set YOLO confidence to 0.05, then filter dynamically by frame score distribution, box size (lower threshold for <5% height boxes), and pose keypoints (nose + shoulders) to detect 3x more people without retraining.

DAY 14May 6, 2026 MAY 6 · 20261 SUMMARIES
Towards AIData Science & Visualization

Synthetic Data Exposes Hidden ML Bias Before Production

Real training data hides bias via underrepresentation (e.g., rural at 9%), proxies, and skewed labels; generate synthetic data with controlled segments (e.g., rural at 25%) to reveal it through disaggregated AUC drops (0.791 to 0.768) and disparate impact <0.8, then retrain on mixed data to fix.

Towards AI
DAY 15May 5, 2026 MAY 5 · 20262 SUMMARIES
MarkTechPostData Science & Visualization

Momentum Dampens GD Zigzags via Gradient Averaging

On anisotropic loss surfaces (condition number 100), vanilla GD zigzags and takes 185 steps to converge (loss <0.001); momentum with β=0.9 converges in 159 steps by canceling steep-direction oscillations while accelerating flat directions—but β=0.99 diverges.

MarkTechPost
Towards AIData Science & Visualization

Track One User-Feature Pair to Catch ML Pipeline Bugs

A rec model's 0.91 AUC failed in prod after 4 days due to 21-hour stale user_30d_purchases features. Track user U-9842 and this feature through every pipeline layer to expose and prevent such mismatches.

DAY 16May 4, 2026 MAY 4 · 20261 SUMMARIES
MarkTechPostData Science & Visualization

Production ML Pipelines with ZenML: Custom Materializers & HPO

ZenML enables end-to-end ML pipelines with custom DatasetBundle materializers for metadata-rich serialization, fan-out over 4 hyperparameter configs for RandomForest/GradientBoosting/LogisticRegression, fan-in best-model selection by ROC AUC, full artifact tracking, and cache-driven reproducibility on breast cancer dataset.

MarkTechPost
DAY 17May 3, 2026 MAY 3 · 20262 SUMMARIES
MarkTechPostData Science & Visualization

Stream Parse TaskTrove Dataset for AI Task Insights

Stream multi-GB TaskTrove dataset without full download; parse gzip-compressed tar/zip/JSON binaries to analyze sources, sizes (median p50 KB compressed), filenames, and detect verifiers for RL-ready tasks via multi-signal heuristics.

MarkTechPost
Data Driven InvestorData Science & Visualization

Build Queryable Options IV DB from Live API Polls

Capture SpiderRock LiveImpliedQuote snapshots for TSLA every 10s into SQLite: append full history for audits (12k+ rows in 2min), upsert latest view per option_key. Query to reconstruct vol smiles and track ATM IV/skew changes over time.

DAY 18May 2, 2026 MAY 2 · 20261 SUMMARIES
Data and BeyondData Science & Visualization

Data Science Splits: Engineer Pipelines or Lead Decisions

Data scientist roles are dividing into technical data engineering (SQL up 18%, ETL up 18%) and strategic decision-making; AI automates mid-level generalist tasks, squeezing the middle—specialize in one side now.

Data and Beyond
DAY 19May 1, 2026 MAY 1 · 20262 SUMMARIES
Data and BeyondData Science & Visualization

Data And Beyond Grows to 49K Views, AI Topics Dominate

April 2026 stats: 49K views, 14.8K reads, +90 followers to 2K. Top stories cover Spark optimization, Claude AI leaks, clustering pitfalls, and RAG vs MCP.

Data and Beyond
Data and BeyondData Science & Visualization

Decompose Signals into Frequencies for Easier Analysis

Fourier transform breaks time-domain signals into frequency components, exposing periodic patterns buried in noise for filtering, compression, and fault detection—reversible and efficient via FFT.

DAY 20April 29, 2026 APR 29 · 20261 SUMMARIES
Learning DataData Science & Visualization

ETL Pipeline Turns Messy HR Data into Star Schema Insights

Build a scalable ETL pipeline to restructure flat HR data into a star schema fact/dimension tables, enabling analysis of manager performance, diversity (60% White, 56.6% female), recruitment channels, and 71% accurate attrition prediction where tenure drives 47% of decisions.

Learning Data
DAY 21April 21, 2026 APR 21 · 20261 SUMMARIES
Learning DataData Science & Visualization

Automate Weekly PDF Reports with Python ETL Pipeline

Load/merge e-commerce datasets, compute revenue/profit/AOV/growth metrics, generate PDF with matplotlib/ReportLab charts and rule-based insights, email via smtplib, schedule weekly via GitHub Actions cron.

Learning Data
DAY 22April 20, 2026 APR 20 · 20261 SUMMARIES
Level Up CodingData Science & Visualization

Preprocessing Swings CNN Accuracy from 65% to 87% on CIFAR-10

Raw CIFAR-10 pixels yield 65% test accuracy; normalization/standardization lift to 69%; geometric augmentation maintains ~67%; photometric brightness/contrast crashes to 20%; combined pipeline with deeper CNN hits 87%.

Level Up Coding

Showing 30 of 67