Google ADK Multi-Agent Data Analysis Pipeline
Build an end-to-end data analysis system in Python using Google ADK: load data, run stats tests, generate viz, and coordinate via a master agent—all with shared state and serializable outputs.
Centralized DataStore for Agent Collaboration
The foundation of this pipeline is a singleton DataStore class that persists datasets, metadata, and analysis history across agents. Instantiate it once:
class DataStore:
_instance = None
def __new__(cls):
if cls._instance is None:
cls._instance = super().__new__(cls)
cls._instance.datasets = {}
cls._instance.analysis_history = []
return cls._instance
Key methods:
add_dataset(name, df, source): Stores DataFrame with shape, columns, timestamp.get_dataset(name): Retrieves DataFrame.list_datasets(): Returns available names.log_analysis(type, dataset, summary): Tracks workflow.
Use DATA_STORE = DataStore() globally. This ensures agents share state without passing DataFrames directly, avoiding serialization issues in tool calls. Trade-off: In-memory only, fine for interactive sessions but scale to Redis for production.
Serialization helper make_serializable(obj) converts NumPy/pandas types to JSON-safe primitives—essential for LLM tool responses.
Data Ingestion: Load and Generate Realistic Samples
Agents need quick access to data. Define tools that update ToolContext state with loaded_datasets list and active_dataset.
CSV Loader:
def load_csv(file_path: str, dataset_name: str, tool_context: ToolContext) -> dict:
df = pd.read_csv(file_path)
result = DATA_STORE.add_dataset(dataset_name, df, source=file_path)
# Update context and return preview
Returns shape, dtypes, head(3) sample.
Sample Generators (seed=42 for reproducibility):
sales: 500 rows—order_id, date, product, revenue, profit.customers: 300 rows—age, income, churn_risk, lifetime_value.timeseries: Daily 2022-2024—trend + seasonal + noise.survey: 200 rows—Likert scores, response_time.
Example:
create_sample_dataset("sales", "sales_data", tool_context)
Lists with list_available_datasets() show rows/columns per dataset.
Pitfall Avoidance: Always check df is None before ops; use tool_context.state for active context. Samples mimic real data distributions (e.g., lognormal income, exponential membership_years).
Statistical Exploration: Describe, Correlate, Test, Detect Outliers
Turn data into insights with deterministic functions returning serialized dicts.
describe_dataset: Splits numeric/categorical; computes mean/std/quantiles/skew for numerics, top values for categoricals. Logs to history.
correlation_analysis (pearson/spearman): Numeric corr matrix + strong pairs (>0.5). Highlights: "Found X pairs with |correlation| > 0.5".
hypothesis_test:
| Test | Params | Output |
|---|---|---|
| normality | column1 | Shapiro-Wilk p>0.05? |
| ttest | column1, group_column (2 groups) | t-stat, p, means |
| anova | column1, group_column (>2) | F-stat, group stats |
| chi2 | column1, column2 | chi2, dof, independence? |
Sample t-test interpretation: "Significant difference" if p<0.05.
outlier_detection (iqr/zscore): IQR bounds or z>3; % outliers + examples.
Quality Criteria: Sample large data (<5000 for Shapiro); dropna everywhere; round floats for readability. Common mistake: Forgetting group_column in group tests—validate upfront.
Visualization Factory: 7 Chart Types with Grouping
create_visualization generates and displays (plt.show/close) charts, returns success message. Supports color_column for grouping.
Supported types:
- histogram/scatter/bar/line/box/heatmap/pie
Examples:
- Bar: Groupby sum or value_counts, annotated values.
- Heatmap: Corr matrix with color-coded text.
- Box: Per-group or single.
create_visualization("sales_data", "bar", "region", "revenue", "category")
distribution_report: 2x2 grid—hist+KDE, box, Q-Q, violin. Tests normality visually.
Pro Tip: Use seaborn-v0_8-whitegrid style, husl palette upfront. Always tight_layout(); close figs to avoid memory leaks in loops.
Multi-Agent Orchestration Setup
Leverage Google ADK for agents/tools:
- LiteLlm(model="openai/gpt-4o-mini")
- InMemorySessionService
- Runner for execution
Tools wrap above functions, registered to ToolContext. Master "analyst" agent coordinates specialists (e.g., loader, stats, viz, reporter) via function calling.
Full workflow: Load → Describe/Corr/Test → Viz → Report. State persists via DataStore/ToolContext.
Prerequisites: Python/pandas/scipy/matplotlib basics; OpenAI API key. Colab-friendly (userdata secrets).
Practice: Generate "sales", test revenue normality by region (ANOVA), viz profit by category, log everything.
"We connect these capabilities through a master analyst agent that coordinates specialists, allowing us to see how a production-style analysis system can handle end-to-end tasks in a structured, scalable way."
"This is great for interactive analysis but watch memory with large CSVs—paginate or stream in prod."
"Agents shine when tools are narrow/single-responsibility; broad tools lead to hallucinated params."
Key Takeaways
- Start with a shared singleton DataStore to eliminate data-passing friction between agents.
- Generate seeded sample datasets to test pipelines without real files—mimic distributions like lognormal for income.
- Serialize all tool outputs: Convert np/pandas to native types for reliable LLM parsing.
- Validate inputs rigorously (e.g., 2 groups for t-test) to prevent agent error loops.
- Use color_column grouping in viz for quick multi-facet insights; always annotate bars/pies.
- Log analysis history for audit trails—replay workflows easily.
- Pick gpt-4o-mini for cost/speed in stats/viz tasks; upgrade for complex reasoning.
- Scale by swapping InMemorySession for persistent store; add async for parallelism.
- Test hypothesis with p<0.05 thresholds but interpret contextually—stats ≠ causation.
- Practice: Build your own tool for custom tests, register to agent, run end-to-end on public CSV.