Automate Weekly PDF Reports with Python ETL Pipeline
Load/merge e-commerce datasets, compute revenue/profit/AOV/growth metrics, generate PDF with matplotlib/ReportLab charts and rule-based insights, email via smtplib, schedule weekly via GitHub Actions cron.
Merge Raw Datasets into Actionable Business Data
Start by loading six Olist e-commerce CSVs (orders, customers, items, payments, products, reviews) with pandas.read_csv, then merge on keys like customer_id, order_id, product_id:
def load_data():
return {
"orders": pd.read_csv("data/olist_orders_dataset.csv"),
# ... other datasets
}
df = data["orders"].merge(data["customers"], on="customer_id", how="left") \
.merge(data["items"], on="order_id", how="left") \
# ... other merges
Convert timestamps to datetime for time-based calcs: df"order_purchase_timestamp" = pd.to_datetime(...). Compute delivery delays as (delivered - estimated).dt.days > 0 for is_delayed. Derive revenue = price + freight_value, profit = price - freight_value. Aggregate metrics like revenue_current = df"revenue".sum(), orders_current = df"order_id".nunique(), AOV = revenue / orders.
Group by month for trends: monthly = df.groupby("month").agg({"revenue": "sum", "order_id": "nunique"}); monthly"growth" = monthly"revenue".pct_change() * 100; monthly"moving_avg" = monthly"revenue".rolling(3).mean().
Simulate weekly reporting with cutoff: df_sim = dfdf"order_purchase_timestamp" <= cutoff_date, advancing cutoff_date = start_date + pd.Timedelta(days=7 * run_count) via state.txt to mimic live cycles without reprocessing all history.
This standardization ensures consistent metric definitions across runs, turning scattered CSVs into a unified view of who bought what, payment amounts, delivery times, and satisfaction.
Add Rule-Based Insights and Build PDF Reports
Metrics alone fail without context—use simple if-conditions to interpret:
def generate_insights(metrics):
insights = []
if metrics["profit_current"] < metrics["revenue_current"]:
insights.append("Revenue growing but profit margin thin, high logistics costs.")
growth_volatility = metrics["monthly"]["growth"].std()
if growth_volatility > 50:
insights.append("Revenue growth highly volatile, unstable performance.")
# ...
Generate PDF with ReportLab: create executive summary (e.g., 2018 revenue < 2017, orders down, AOV stable, 9.36% delay rate, 3.91 avg review score), KPI trends (Jan 2018 revenue/profit >600% over 2017 but slowing; AOV 2-14% lower, driven by transaction volume), top products (relogios_presentes/beleza_saude ~510K revenue each), delivery (SE state 33% delays, casa_conforto_2 60%; overall -10.76 avg delay days = early deliveries), payments (credit card 75%, boleto 19.1%), reviews (5-stars dominant, avg 3.91).
Key patterns: thin margins from costs; volatile growth; new-customer reliance; delays hurt scores; SP top region; credit users spend more.
Code charts with matplotlib (plt.savefig("revenue_chart.png")), insert via Image(width=450,height=220), tables via Table(table_data). Central pipeline: data → transform → metrics → insights → generate_report().
Schedule Email Delivery with GitHub Actions
Automate email: use smtplib.SMTP_SSL('smtp.gmail.com',465), login via os.getenv("EMAIL_SENDER/PASSWORD"), attach PDF, dynamic subject. Secure creds in GitHub Secrets (EMAIL_SENDER, EMAIL_PASSWORD, EMAIL_RECEIVER).
Deploy via .github/workflows/auto-report.yml:
on:
schedule:
- cron: '0 1 * * 1' # Mondays 1AM UTC
jobs:
# setup env, pip install, run main.py
Triggers workflow: installs deps, executes pipeline (advances run_count), generates/sends report. No local runs—wake to delivered emails. Full loop: cron → ETL → PDF → email → state update for next cutoff.
Trade-offs: Relies on GitHub free tier (2k min/month); Gmail app passwords needed; rule-insights basic (extend with ML if needed). Scales to live data sources by swapping CSVs for APIs/DBs.