Google ADK Multi-Agent Data Analysis Pipeline

Centralized DataStore for Agent Collaboration

The foundation of this pipeline is a singleton DataStore class that persists datasets, metadata, and analysis history across agents. Instantiate it once:

class DataStore:
    _instance = None
    def __new__(cls):
        if cls._instance is None:
            cls._instance = super().__new__(cls)
            cls._instance.datasets = {}
            cls._instance.analysis_history = []
        return cls._instance

Key methods:

add_dataset(name, df, source): Stores DataFrame with shape, columns, timestamp.
get_dataset(name): Retrieves DataFrame.
list_datasets(): Returns available names.
log_analysis(type, dataset, summary): Tracks workflow.

Use DATA_STORE = DataStore() globally. This ensures agents share state without passing DataFrames directly, avoiding serialization issues in tool calls. Trade-off: In-memory only, fine for interactive sessions but scale to Redis for production.

Serialization helper make_serializable(obj) converts NumPy/pandas types to JSON-safe primitives—essential for LLM tool responses.

Data Ingestion: Load and Generate Realistic Samples

Agents need quick access to data. Define tools that update ToolContext state with loaded_datasets list and active_dataset.

CSV Loader:

def load_csv(file_path: str, dataset_name: str, tool_context: ToolContext) -> dict:
    df = pd.read_csv(file_path)
    result = DATA_STORE.add_dataset(dataset_name, df, source=file_path)
    # Update context and return preview

Returns shape, dtypes, head(3) sample.

Sample Generators (seed=42 for reproducibility):

sales: 500 rows—order_id, date, product, revenue, profit.
customers: 300 rows—age, income, churn_risk, lifetime_value.
timeseries: Daily 2022-2024—trend + seasonal + noise.
survey: 200 rows—Likert scores, response_time.

Example:

create_sample_dataset("sales", "sales_data", tool_context)

Lists with list_available_datasets() show rows/columns per dataset.

Pitfall Avoidance: Always check df is None before ops; use tool_context.state for active context. Samples mimic real data distributions (e.g., lognormal income, exponential membership_years).

Statistical Exploration: Describe, Correlate, Test, Detect Outliers

Turn data into insights with deterministic functions returning serialized dicts.

describe_dataset: Splits numeric/categorical; computes mean/std/quantiles/skew for numerics, top values for categoricals. Logs to history.

correlation_analysis (pearson/spearman): Numeric corr matrix + strong pairs (>0.5). Highlights: "Found X pairs with |correlation| > 0.5".

hypothesis_test:

Test	Params	Output
normality	column1	Shapiro-Wilk p>0.05?
ttest	column1, group_column (2 groups)	t-stat, p, means
anova	column1, group_column (>2)	F-stat, group stats
chi2	column1, column2	chi2, dof, independence?

Sample t-test interpretation: "Significant difference" if p<0.05.

outlier_detection (iqr/zscore): IQR bounds or z>3; % outliers + examples.

Quality Criteria: Sample large data (<5000 for Shapiro); dropna everywhere; round floats for readability. Common mistake: Forgetting group_column in group tests—validate upfront.

Visualization Factory: 7 Chart Types with Grouping

create_visualization generates and displays (plt.show/close) charts, returns success message. Supports color_column for grouping.

Supported types:

histogram/scatter/bar/line/box/heatmap/pie

Examples:

Bar: Groupby sum or value_counts, annotated values.
Heatmap: Corr matrix with color-coded text.
Box: Per-group or single.

create_visualization("sales_data", "bar", "region", "revenue", "category")

distribution_report: 2x2 grid—hist+KDE, box, Q-Q, violin. Tests normality visually.

Pro Tip: Use seaborn-v0_8-whitegrid style, husl palette upfront. Always tight_layout(); close figs to avoid memory leaks in loops.

Multi-Agent Orchestration Setup

Leverage Google ADK for agents/tools:

LiteLlm(model="openai/gpt-4o-mini")
InMemorySessionService
Runner for execution

Tools wrap above functions, registered to ToolContext. Master "analyst" agent coordinates specialists (e.g., loader, stats, viz, reporter) via function calling.

Full workflow: Load → Describe/Corr/Test → Viz → Report. State persists via DataStore/ToolContext.

Prerequisites: Python/pandas/scipy/matplotlib basics; OpenAI API key. Colab-friendly (userdata secrets).

Practice: Generate "sales", test revenue normality by region (ANOVA), viz profit by category, log everything.

"We connect these capabilities through a master analyst agent that coordinates specialists, allowing us to see how a production-style analysis system can handle end-to-end tasks in a structured, scalable way."

"This is great for interactive analysis but watch memory with large CSVs—paginate or stream in prod."

"Agents shine when tools are narrow/single-responsibility; broad tools lead to hallucinated params."

Key Takeaways

Start with a shared singleton DataStore to eliminate data-passing friction between agents.
Generate seeded sample datasets to test pipelines without real files—mimic distributions like lognormal for income.
Serialize all tool outputs: Convert np/pandas to native types for reliable LLM parsing.
Validate inputs rigorously (e.g., 2 groups for t-test) to prevent agent error loops.
Use color_column grouping in viz for quick multi-facet insights; always annotate bars/pies.
Log analysis history for audit trails—replay workflows easily.
Pick gpt-4o-mini for cost/speed in stats/viz tasks; upgrade for complex reasoning.
Scale by swapping InMemorySession for persistent store; add async for parallelism.
Test hypothesis with p<0.05 thresholds but interpret contextually—stats ≠ causation.
Practice: Build your own tool for custom tests, register to agent, run end-to-end on public CSV.