Rigorous QC and Filtering Removes Noise for Reliable Downstream Analysis
Load PBMC-3k via sc.datasets.pbmc3k() (2700 cells, ~2k genes/cell). Compute QC metrics for mitochondrial (MT- prefix, filter <5% pct_counts_mt) and ribosomal (RPS/RPL) genes using sc.pp.calculate_qc_metrics. Visualize with violin plots (n_genes_by_counts, total_counts, pct_counts_mt) and scatters to spot outliers.
Filter: min_genes=200, min_cells=3, upper n_genes_by_counts <2500. Detect doublets via sc.pp.scrublet (removes ~sum of predicted_doublet). Preserve raw in layers["counts"]. This yields cleaner data, preventing artifacts in clustering.
Normalization, HVGs, and Cell-Cycle Correction Focus on Biological Signal
Normalize to 10k counts (sc.pp.normalize_total(target_sum=1e4)), log-transform (sc.pp.log1p). Identify highly variable genes (sc.pp.highly_variable_genes(min_mean=0.0125, max_mean=3, min_disp=0.5)), subset to them (adata = adata[:, adata.var.highly_variable]). Store raw in adata.raw.
Score S/G2M phases with 40+ predefined markers (e.g., S: MCM5,PCNA; G2M: HMGB2,CDK1, filter to dataset genes). Regress out total_counts, pct_counts_mt (sc.pp.regress_out). Scale (sc.pp.scale(max_value=10)). These steps isolate biological variance, regressing technical noise for accurate modeling.
Dimensionality Reduction, Leiden Clustering, and Marker-Based Annotation Reveals Cell Types
PCA (sc.tl.pca(svd_solver="arpack"), check n_pcs=50 variance). Neighbors (sc.pp.neighbors(n_neighbors=10, n_pcs=40)). Embeddings: UMAP (sc.tl.umap), t-SNE (sc.tl.tsne(n_pcs=40)).
Cluster with Leiden (sc.tl.leiden(resolution=0.5, flavor="igraph", n_iterations=2)). Rank markers (sc.tl.rank_genes_groups(method="wilcoxon"), top 10/cluster via Wilcoxon). Annotate using PBMC markers: B-cell (CD79A,MS4A1), CD8 T (CD8A,CD8B), CD4 T (IL7R,CD4), NK (GNLY,NKG7), CD14 Mono (CD14,LYZ), FCGR3A Mono (FCGR3A,MS4A7), Dendritic (FCER1A,CST3), Mega (PPBP). Confirm via sc.pl.dotplot, sc.pl.stacked_violin(groupby="leiden"). Visualizes 8-9 clusters matching immune subsets.
PAGA Trajectories, Pseudotime, and Custom Scores Enable Developmental Insights
Graph-based trajectories: sc.tl.paga(groups="leiden"), threshold=0.1, init UMAP (sc.tl.umap(init_pos="paga")). Diffusion maps (sc.tl.diffmap), recompute neighbors on X_diffmap, root at cluster 0 (adata.uns["iroot"]), pseudotime (sc.tl.dpt). Plot dpt_pseudotime on UMAP.
Custom score: IFN-response genes (ISG15,IFI6,IFIT1,IFIT3,MX1,OAS1,STAT1,IRF7) via sc.tl.score_genes(score_name="IFN_score"), cmap="viridis". Save full AnnData (adata.write("pbmc3k_analyzed.h5ad")) with embeddings, clusters, scores for reuse. Extends basic clustering to infer progression and response states.