Parse, Analyze, Visualize Hermes Agent Traces for Fine-Tuning

Extracting Thoughts, Tool Calls, and Responses from Traces

Agent conversations in the lambda/hermes-agent-reasoning-traces dataset (Hugging Face, "kimi" config) consist of turns from "system", "human", "gpt", and "tool" roles. Use regex to parse gpt messages: THINK_RE = re.compile(r"<think>(.*?)</think>", re.DOTALL) captures internal reasoning; TOOL_CALL_RE = re.compile(r"<tool_call>\s*(\{.*?\})\s*</tool_call>", re.DOTALL) grabs JSON tool calls (with json.loads fallback for malformed); remaining text after stripping is the final answer. Tool responses parse via TOOL_RESP_RE into JSON or raw. This separates internal reasoning from actions, enabling per-turn analysis. Test on samples reveals thoughts like planning steps, calls like {"name": "search", "arguments": {...}}, and handles parallel calls (multiple per turn).

Tool schemas from json.loads(ex["tools"]) list available functions with names/descriptions. Render full traces with render_trace(ex) to display USER, THINK, CALL, TOOL_RESPONSE, ANSWER for inspection, shortening long text.

Quantifying Behaviors: Tool Usage, Lengths, and Errors

Scan 3000 trajectories to aggregate: count tool calls per category/subcategory/task; track turns per trajectory, thoughts per gpt turn, calls per trajectory, errors ("error" in response JSON, exit_code=1, traceback). Compute averages like turns/traj, calls/traj; % trajectories with errors; % parallel turns (width >1). Top tools via Counter on call names. Length distributions: histogram characters in thoughts, json.dumps(tool_calls), final answers across 500 examples—reveals typical reasoning/tool/answer sizes for token budgeting.

TraceReplayer class reconstructs steps: each gpt turn pairs with subsequent tool responses, enabling step-by-step playback: print thoughts, calls with args, responses, final. Identifies patterns like avg 5-10 turns/traj (via hist), frequent tools (e.g., search/browse top), low error rates for robust behaviors.

Visualizing Trends and Prepping for SFT

Four-panel plot: horizontal bar top 15 tools by volume; log-scale bar parallel widths (# calls/turn); histogram conversation lengths (bins=40); pie category distribution. Highlights: most turns single-tool, skewed long-tail convos, dominant categories.

For training, convert to OpenAI messages: map "gpt"→"assistant", "tool"→"user". Tokenize with Qwen/Qwen2.5-0.5B-Instruct: apply_chat_template per message, encode, mask non-assistant labels (-100). Truncates to 2048/1024 tokens; ~30-50% trainable (assistant only). TRL SFTTrainer demo: map to text field, load model (fp16), train 200 examples (batch=1, accum=4, steps=20, lr=2e-5, seq=1024). Handles tool as "TOOL\n" prefix. Yields production-ready format for fine-tuning tool-use/reasoning.