Master OpenMementos: Parse Traces, Compress Context, Prep SFT Data

Stream Dataset Efficiently Without Full Download

OpenMementos structures long reasoning traces as sequences of detailed units paired with concise summaries, enabling context compression for LLMs. Start by installing essentials: datasets, transformers, matplotlib, pandas. Load in streaming mode to inspect schema without gigabytes of storage:

DATASET = "microsoft/OpenMementos"
ds_stream = load_dataset(DATASET, split="train", streaming=True)
first_row = next(iter(ds_stream))
print("Columns     :", list(first_row.keys()))

This reveals keys like domain (e.g., math, code), source, problem, response. Responses embed special tokens: <|block_start|>...<|block_end|>, <|summary_start|>...<|summary_end|>, <think>...</think>. Streaming supports analysis on massive datasets (e.g., process 500 samples via itertools.islice). Assumes familiarity with Hugging Face datasets and Python REPL/Colab; no prior OpenMementos knowledge needed.

Common pitfall: Ignoring streaming—full download fails on consumer hardware. Principle: Process lazily to handle 1M+ traces across domains like science, code, math.

Extract Blocks, Mementos, and Compute Compression Ratios

Define a regex parser to dismantle responses:

BLOCK_RE   = re.compile(r"<\|block_start\|>(.*?)<\|block_end\|>", re.DOTALL)
SUMMARY_RE = re.compile(r"<\|summary_start\|>(.*?)<\|summary_end\|>", re.DOTALL)
THINK_RE   = re.compile(r"<think>(.*?)</think>", re.DOTALL)

def parse_memento(response: str) -> Dict:
    blocks = [m.strip() for m in BLOCK_RE.findall(response)]
    summaries = [m.strip() for m in SUMMARY_RE.findall(response)]
    # ... (think, final_ans)
    return {"blocks": blocks, "summaries": summaries, ...}

Validate: Blocks match summaries 1:1; skip malformed. For N=500 samples, tally chars/words per domain, compute ratios (mementos/blocks). Use Pandas for aggregation:

per_dom = df.groupby("domain").agg({
    "n_blocks": "median",
    "compress_char": "median",  # ~0.15-0.20 typical
}).round(3)

Medians show code domain: 12 blocks, 6x token compression (paper benchmark); math: deeper traces, 4-5x. Visualize distributions: df.plot.scatter(x='block_words', y='summ_words') reveals linear scaling—mementos ~15-20% block length.

Quality criteria: Good traces have balanced block-memento pairs; compression >4x signals effective summarization. Mistake: Naive string splits—regex handles newlines/specials. Fits mid-workflow: Post-loading, pre-training.

Before: Raw response (10k+ chars). After parsing: Itemized blocks (e.g., Block 1: "Consider the equation...") vs. Memento 1: "Equation simplified to quadratic." Principle: Mementos preserve decisions, discard verbose steps.

Simulate Inference Compression and Render Traces

Mimic runtime: Replace early blocks with mementos, keep last K=1 full:

def compress_trace(response: str, keep_last_k: int = 1) -> str:
    blocks, summaries = BLOCK_RE.findall(response), SUMMARY_RE.findall(response)
    out = ["<think>"]
    for i, (b, s) in enumerate(zip(blocks, summaries)):
        if i >= len(blocks) - keep_last_k:
            out.append(f"<|block_start|>{b}<|block_end|>")
            out.append(f"<|summary_start|>{s}<|summary_end|>")
        else:
            out.append(f"<|summary_start|>{s}<|summary_end|>")
    # Append </think> + final_ans
    return "\n".join(out)

Example: Original 8k chars → Compressed 2k (25%). Token-level (GPT-2 + specials): Blocks 1200 → Mementos 200 (6x).

tok = AutoTokenizer.from_pretrained("gpt2")
tok.add_special_tokens({"additional_special_tokens": MEM_TOKENS})
def tlen(s): return len(tok(s, add_special_tokens=False).input_ids)

Render for inspection:

def render_trace(response: str, width: int = 220) -> None:
    p = parse_memento(response)
    for i, (b, s) in enumerate(zip(p["blocks"], p["summaries"]), 1):
        ratio = len(s) / max(len(b), 1) * 100
        print(f"▶ BLOCK {i} ({len(b):,} chars)")
        print(textwrap.indent(...))
        print(f"◀ MEMENTO {i} ({len(s):,} chars · {ratio:.1f}%)")

Outputs side-by-side: Block verbosity vs. memento brevity. Exercise: Tweak keep_last_k=2; measure KV cache savings.

Pitfall: Forgetting specials in tokenizer—distorts counts. Good output: Compressed trace parses back to ~90% original info.

Format for Supervised Fine-Tuning

Convert to chat ML:

def to_chat(ex):
    return {"messages": [
        {"role": "user", "content": ex["problem"]},
        {"role": "assistant", "content": ex["response"]},
    ]}
chat_stream = load_dataset(...).map(to_chat)

Stream full subset for extras (sentence alignments). Principle: SFT-ready preserves tokens for LoRA/PEFT; compression cuts costs 4-6x.

"Trace-level token compression for this example: block tokens = 1200, memento tokens = 200, compression = 6.00× (paper reports ~6×)"

"Analyzed 500 rows. Domain counts: code 180, math 150... Per-domain medians (ratio = mementos / blocks): code 0.167 char ratio"

"Original: 8,452 chars, Compressed: 2,134 chars (25.3% of original)"

Key Takeaways

Stream OpenMementos with load_dataset(..., streaming=True) to analyze without full download.
Use regex BLOCK_RE, SUMMARY_RE to parse blocks/mementos; validate 1:1 pairing.
Compute compression: sum(len(s.split()) for s in summaries) / sum(len(b.split()) for b in blocks); expect 4-6x tokens.
Simulate inference: compress_trace(keep_last_k=1) replaces early blocks with mementos.
Add special tokens to tokenizer before tlen() for accurate counts.
Render traces with textwrap.indent() for manual review of block-memento fidelity.
Map to {"messages": [...chat format]} for direct SFT pipelines.
Group by domain in Pandas; math/code differ in trace depth—tailor analysis.
Practice: Process 1k samples, plot compress_word histograms per domain.
Scale: Align streamed data with full subset fields for richer annotations.