ETL Pipeline Turns Messy HR Data into Star Schema Insights
Build a scalable ETL pipeline to restructure flat HR data into a star schema fact/dimension tables, enabling analysis of manager performance, diversity (60% White, 56.6% female), recruitment channels, and 71% accurate attrition prediction where tenure drives 47% of decisions.
Restructure Flat Data into Star Schema for Efficient Analysis
Raw HR datasets arrive as wide, redundant tables that slow queries and complicate scaling. Transform them into a star schema: one central fact table for employee records (EmpID, Age, tenure_years, is_attrition, foreign keys like department_id) surrounded by dimension tables (department, position, salary with qcut-segmented levels: Low/Medium/High for equal distribution groups). This reduces redundancy, speeds queries, and adds business meaning—e.g., salary_level enables quick counts of high-salary employees. Use pd.read_csv for extraction, then merge unique values back with surrogate keys (index + 1) to link facts to dimensions, creating maintainable analytical workloads over monolithic tables.
Clean and Engineer Features Robustly from Unreliable Raw Data
Don't trust provided fields—derive them. Strip column whitespace to prevent code breaks. Convert strings to datetime with errors='coerce' for DateofHire, DateofTermination, DOB (format='%m/%d/%y'). Compute Age as (today - DOB).days // 365, tenure_years as (today - DateofHire).days / 365, is_attrition as DateofTermination.notna(), is_active as opposite. Fill missing Salary and Age with medians (outlier-resistant over means). These steps turn inconsistent inputs into reliable features for downstream analysis and ML, emphasizing derivation over assumption.
Extract Actionable HR Insights Post-Transformation
Query structured data reveals: Managers show no strong performance impact—most employees rate 'Fully Meets' across leaders, with minor 'Exceeds' variations (e.g., Ketsia Liebig, Brandon Miller) and rare 'PIP/Needs Improvement'. Diversity: 60% White, 26% Black/African American, 9% Asian; gender balanced at 56.6% female vs. 43.4% male. Recruitment: Diversity Job Fair yields 100% Black hires; Indeed/LinkedIn balanced; Google Search varied but White-dominant; avoid Online Web Application/Other (100% White). Stacked crosstabs and countplots highlight channels driving diversity, prioritizing targeted sources over uniform ones.
Predict Attrition at 71% Accuracy with Key Drivers Identified
Leverage cleaned fact table merges (absences, salary dims) for RandomForestClassifier on age, tenure_years, absences, Salary (filled medians). Train/test split (80/20) yields 71% accuracy, 59% precision/recall for attrition (confusion: 32 true stay, 13 true leave, 9 misses each). Feature importances: tenure (47%), Salary (23%), absences moderate, age lowest—focus retention on long-tenured, low-salary employees with absences to cut churn.