Pearson's r: Quantifying Linear Correlations Precisely

Formula and Computation for Populations and Samples

Pearson's ρ (population) or r (sample) is covariance divided by the product of standard deviations: ρ_{X,Y} = cov(X,Y) / (σ_X σ_Y). Covariance expands to E(X - μ_X)(Y - μ_Y), making r the cosine of the angle between mean-centered variable vectors—1 for collinear points, 0 for orthogonal, -1 for opposite directions.

For samples of n pairs (x_i, y_i), r = Σ(x_i - x̄)(y_i - ȳ) / √(Σ(x_i - x̄)^2) √(Σ(y_i - ȳ)^2), using n-1 for unbiased variance. Computationally, center data (subtract means), then r equals the dot product divided by vector magnitudes. This vector view reveals why r ignores scale: it's invariant to linear transformations (aX + b, cY + d).

"The correlation coefficient can be derived by shifting the x and y data values so they each have zero average... and computing the cosine between these two vector directions."

Practical tip: Use libraries like NumPy's np.corrcoef(x, y)0,1 or SciPy's pearsonr(x, y) for p-value; preprocess outliers as they inflate variance disproportionately.

Interpretation: Strength, Direction, and Visual Geometry

r > 0 signals positive linear trend (both rise/fall together), r < 0 negative; |r| near 1 strong, near 0 weak/no linear link. Unlike slope, r standardizes: steep shallow lines can yield same r if dispersions match. Scatterplots clarify: top row in examples shows r reflecting linear strength, middle varying slopes same r, bottom nonlinear (e.g., quadratic) r near 0 despite pattern.

Geometric quotient: r = covariance / (σ_X σ_Y), projecting one variable onto the other normalized by spreads. For bivariate normal, r equals the slope of the regression line times σ_Y/σ_X.

Size guide: |r| 0.00-0.10 negligible, 0.10-0.30 small, 0.30-0.50 medium, ≥0.50 large (per Cohen). But context matters—r=0.5 in psychometrics is substantial, trivial in physics.

"A key difference is that unlike covariance, this correlation coefficient does not have units, allowing comparison of the strength of the joint association between different pairs of random variables."

Inference: Testing Significance and Confidence

Null hypothesis: ρ=0 (no linear correlation). For large n, z = 0.5 ln((1+r)/(1-r)) (Fisher transform) ~ N(0, 1/√(n-3)) for intervals. t-test: t = r √((n-2)/(1-r²)) ~ t_, p-value via CDF.

Nonparametric: Permutation test shuffles y, recomputes r 10,000x, checks observed r extremity. Bootstrap: Resample pairs with replacement, get r distribution for CI (e.g., 2.5th/97.5th percentiles).

Exact for small n via hypergeometric, but Fisher preferred for asymmetry. Standard error ≈ 1/√n for r near 0. Power analysis: For ρ=0.3, n=85 yields 80% power at α=0.05.

"Using the Fisher transformation... the sampling distribution of the transformed parameter z = artanh(r) is approximately normal."

Limitations: Nonlinearity, Outliers, and Robustness

r detects only monotonic linear relations; curves (e.g., U-shape) yield low r despite dependence. Existence requires finite variances; undefined if σ_Y=0 (constant Y). Small n amplifies sampling error: n<30 risks instability.

Sensitive to outliers: One leverage point skews r dramatically. Non-normal data (skewed/heavy tails) biases inference; assumes bivariate normality for t-test validity.

Robustness hacks: Winsorize outliers, use Spearman/Kendall rank for monotonicity, or robust variants like skipped correlations.

"As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations."

Specialized Variants and Extensions

Weighted r weights observations (e.g., survey sizes). Partial r removes third-variable control: r_ = (r_xy - r_xz r_yz)/√((1-r_xz²)(1-r_yz²)). Scaled r splits into distance/significance correlation.

Multivariate decorrelation: For n variables, covariance matrix diagonalization via PCA whitens to identity correlations. Quantum variant for entangled states.

In regression, r² = explained variance fraction (not coefficient of determination for multiple predictors).

Software: R cor(), Python pandas.corr(method='pearson'), MATLAB corrcoef().

Key Takeaways

Compute r on mean-centered data as vector cosine; always pair with scatterplot to confirm linearity.
Interpret |r|: <0.3 weak, 0.3-0.5 moderate, >0.5 strong—but validate with domain knowledge.
Test significance with t = r √((n-2)/(1-r²)); prefer bootstrap/permutation for non-normal data.
Avoid r for causation, nonlinearity, or tiny samples (n<30); switch to Spearman for ranks.
Preprocess: Remove exact duplicates, handle missing via pairwise deletion, cap outliers at 3σ.
Scale insight: r invariant to units/shifts, ideal for comparing associations (e.g., height-weight vs. temp-sales).
In ML pipelines, use r for feature selection: Drop |r|>0.8 collinear pairs to reduce multicollinearity.
Fisher transform for meta-analysis: Average z = artanh(r), back-transform for pooled ρ.