Production ML Pipelines with ZenML: Custom Materializers & HPO

Custom Materializers Enable Metadata-Rich Data Handling

Define DatasetBundle to encapsulate X, y, feature_names, and stats from sklearn's load_breast_cancer (569 samples, 30 features). Pair it with DatasetBundleMaterializer inheriting BaseMaterializer: save() stores X.npy, y.npy, and meta.json with feature_names/stats; load() reconstructs from files; extract_metadata() computes n_samples, n_features, class_distribution (e.g., {0: 357, 1: 212}). This auto-logs queryable metadata to artifacts, ensuring domain objects serialize seamlessly without pickling issues, while supporting ZenML's reproducibility.

Modular Steps Log Hyperparameters and Metrics at Every Stage

Use @step(enable_cache=True) for load_data() returning AnnotatedDatasetBundle, "raw_dataset". split_and_scale() performs stratified train_test_split (default test_size=0.2), StandardScaler fit/transform, logs train_size/test_size via log_metadata(). train_candidate() supports model_type="random_forest"|"gradient_boosting"|"logistic" with n_estimators=100, max_depth=5 defaults, fits on X_train/y_train, logs model_type/hyperparameters. evaluate_candidate() computes accuracy, f1, roc_auc on X_test/y_test (using predict_proba if available), logs all metrics with label. These steps cache outputs, track lineage, and expose metadata for debugging/production monitoring.

Fan-Out HPO and Fan-In Selection Promote Best Model

SEARCH_SPACE defines 4 configs: {"model_type": "random_forest", "n_estimators": 50/200, "max_depth": 3/7}, {"gradient_boosting": 100/3}, {"logistic":1/1}. @pipeline(model=PRODUCTION_MODEL) training_pipeline() fans out: load_data → split_and_scale → loop over train_candidate(id=f"train_") and evaluate_candidate(id=f"eval", label=f"{type}(n={n},d={d})"). Fan-in via select_best(): picks max ROC AUC index, logs winning_metrics/chosen_candidate to model metadata, returns production_model to versioned breast_cancer_classifier (tags="tutorial","advanced"). Generates 8 step runs (4 train+4 eval), automates promotion via Model control plane.

Client API Ensures Inspection, Caching, and Zero-Recompute Reruns

Post-run, Client().get_pipeline_run() shows status, step counts (e.g., 9 steps), aggregated metadata. get_model_version("latest") reveals version.number, linked artifacts, run_metadata (e.g., chosen_candidate). Reload prod_model = get_artifact_version("production_model").load(), verify accuracy_score on stored X_test/y_test. raw_dataset metadata includes n_samples=569, n_features=30, class_distribution. Rerun hits cache (enable_cache=True), skips recompute. list_pipeline_runs(), list_model_versions(), list_artifact_versions() enable querying; full notebook at GitHub confirms 100% reproducibility without redundant work.