Automate Weekly PDF Reports with Python ETL Pipeline

Load/merge e-commerce datasets, compute revenue/profit/AOV/growth metrics, generate PDF with matplotlib/ReportLab charts and rule-based insights, email via smtplib, schedule weekly via GitHub Actions cron.

Merge Raw Datasets into Actionable Business Data

Start by loading six Olist e-commerce CSVs (orders, customers, items, payments, products, reviews) with pandas.read_csv, then merge on keys like customer_id, order_id, product_id:

def load_data():
    return {
        "orders": pd.read_csv("data/olist_orders_dataset.csv"),
        # ... other datasets
    }

df = data["orders"].merge(data["customers"], on="customer_id", how="left") \
    .merge(data["items"], on="order_id", how="left") \
    # ... other merges

Convert timestamps to datetime for time-based calcs: df"order_purchase_timestamp" = pd.to_datetime(...). Compute delivery delays as (delivered - estimated).dt.days > 0 for is_delayed. Derive revenue = price + freight_value, profit = price - freight_value. Aggregate metrics like revenue_current = df"revenue".sum(), orders_current = df"order_id".nunique(), AOV = revenue / orders.

Group by month for trends: monthly = df.groupby("month").agg({"revenue": "sum", "order_id": "nunique"}); monthly"growth" = monthly"revenue".pct_change() * 100; monthly"moving_avg" = monthly"revenue".rolling(3).mean().

Simulate weekly reporting with cutoff: df_sim = dfdf"order_purchase_timestamp" <= cutoff_date, advancing cutoff_date = start_date + pd.Timedelta(days=7 * run_count) via state.txt to mimic live cycles without reprocessing all history.

This standardization ensures consistent metric definitions across runs, turning scattered CSVs into a unified view of who bought what, payment amounts, delivery times, and satisfaction.

Add Rule-Based Insights and Build PDF Reports

Metrics alone fail without context—use simple if-conditions to interpret:

def generate_insights(metrics):
    insights = []
    if metrics["profit_current"] < metrics["revenue_current"]:
        insights.append("Revenue growing but profit margin thin, high logistics costs.")
    growth_volatility = metrics["monthly"]["growth"].std()
    if growth_volatility > 50:
        insights.append("Revenue growth highly volatile, unstable performance.")
    # ...

Generate PDF with ReportLab: create executive summary (e.g., 2018 revenue < 2017, orders down, AOV stable, 9.36% delay rate, 3.91 avg review score), KPI trends (Jan 2018 revenue/profit >600% over 2017 but slowing; AOV 2-14% lower, driven by transaction volume), top products (relogios_presentes/beleza_saude ~510K revenue each), delivery (SE state 33% delays, casa_conforto_2 60%; overall -10.76 avg delay days = early deliveries), payments (credit card 75%, boleto 19.1%), reviews (5-stars dominant, avg 3.91).

Key patterns: thin margins from costs; volatile growth; new-customer reliance; delays hurt scores; SP top region; credit users spend more.

Code charts with matplotlib (plt.savefig("revenue_chart.png")), insert via Image(width=450,height=220), tables via Table(table_data). Central pipeline: data → transform → metrics → insights → generate_report().

Schedule Email Delivery with GitHub Actions

Automate email: use smtplib.SMTP_SSL('smtp.gmail.com',465), login via os.getenv("EMAIL_SENDER/PASSWORD"), attach PDF, dynamic subject. Secure creds in GitHub Secrets (EMAIL_SENDER, EMAIL_PASSWORD, EMAIL_RECEIVER).

Deploy via .github/workflows/auto-report.yml:

on:
  schedule:
    - cron: '0 1 * * 1'  # Mondays 1AM UTC
jobs:
  # setup env, pip install, run main.py

Triggers workflow: installs deps, executes pipeline (advances run_count), generates/sends report. No local runs—wake to delivered emails. Full loop: cron → ETL → PDF → email → state update for next cutoff.

Trade-offs: Relies on GitHub free tier (2k min/month); Gmail app passwords needed; rule-insights basic (extend with ML if needed). Scales to live data sources by swapping CSVs for APIs/DBs.

Summarized by x-ai/grok-4.1-fast via openrouter

8933 input / 2254 output tokens in 17256ms

© 2026 Edge