Master OpenMementos: Parse Traces, Compress Context, Prep SFT Data
Stream Microsoft's OpenMementos dataset, parse block-memento structures with regex, measure ~6x token compression, simulate inference traces, and format for supervised fine-tuning—all in a Colab-ready Python workflow.
Stream Dataset Efficiently Without Full Download
OpenMementos structures long reasoning traces as sequences of detailed datasets, transformers, matplotlib, pandas. Load in streaming mode to inspect schema without gigabytes of storage:
DATASET = "microsoft/OpenMementos"
ds_stream = load_dataset(DATASET, split="train", streaming=True)
first_row = next(iter(ds_stream))
print("Columns :", list(first_row.keys()))
This reveals keys like domain (e.g., math, code), source, problem, response. Responses embed special tokens: <|block_start|>...<|block_end|>, <|summary_start|>...<|summary_end|>, <think>...</think>. Streaming supports analysis on massive datasets (e.g., process 500 samples via itertools.islice). Assumes familiarity with Hugging Face datasets and Python REPL/Colab; no prior OpenMementos knowledge needed.
Common pitfall: Ignoring streaming—full download fails on consumer hardware. Principle: Process lazily to handle 1M+ traces across domains like science, code, math.
Extract Blocks, Mementos, and Compute Compression Ratios
Define a regex parser to dismantle responses:
BLOCK_RE = re.compile(r"<\|block_start\|>(.*?)<\|block_end\|>", re.DOTALL)
SUMMARY_RE = re.compile(r"<\|summary_start\|>(.*?)<\|summary_end\|>", re.DOTALL)
THINK_RE = re.compile(r"<think>(.*?)</think>", re.DOTALL)
def parse_memento(response: str) -> Dict:
blocks = [m.strip() for m in BLOCK_RE.findall(response)]
summaries = [m.strip() for m in SUMMARY_RE.findall(response)]
# ... (think, final_ans)
return {"blocks": blocks, "summaries": summaries, ...}
Validate: Blocks match summaries 1:1; skip malformed. For N=500 samples, tally chars/words per domain, compute ratios (mementos/blocks). Use Pandas for aggregation:
per_dom = df.groupby("domain").agg({
"n_blocks": "median",
"compress_char": "median", # ~0.15-0.20 typical
}).round(3)
Medians show code domain: 12 blocks, 6x token compression (paper benchmark); math: deeper traces, 4-5x. Visualize distributions: df.plot.scatter(x='block_words', y='summ_words') reveals linear scaling—mementos ~15-20% block length.
Quality criteria: Good traces have balanced block-memento pairs; compression >4x signals effective summarization. Mistake: Naive string splits—regex handles newlines/specials. Fits mid-workflow: Post-loading, pre-training.
Before: Raw response (10k+ chars). After parsing: Itemized blocks (e.g., Block 1: "Consider the equation...") vs. Memento 1: "Equation simplified to quadratic." Principle: Mementos preserve decisions, discard verbose steps.
Simulate Inference Compression and Render Traces
Mimic runtime: Replace early blocks with mementos, keep last K=1 full:
def compress_trace(response: str, keep_last_k: int = 1) -> str:
blocks, summaries = BLOCK_RE.findall(response), SUMMARY_RE.findall(response)
out = ["<think>"]
for i, (b, s) in enumerate(zip(blocks, summaries)):
if i >= len(blocks) - keep_last_k:
out.append(f"<|block_start|>{b}<|block_end|>")
out.append(f"<|summary_start|>{s}<|summary_end|>")
else:
out.append(f"<|summary_start|>{s}<|summary_end|>")
# Append </think> + final_ans
return "\n".join(out)
Example: Original 8k chars → Compressed 2k (25%). Token-level (GPT-2 + specials): Blocks 1200 → Mementos 200 (6x).
tok = AutoTokenizer.from_pretrained("gpt2")
tok.add_special_tokens({"additional_special_tokens": MEM_TOKENS})
def tlen(s): return len(tok(s, add_special_tokens=False).input_ids)
Render for inspection:
def render_trace(response: str, width: int = 220) -> None:
p = parse_memento(response)
for i, (b, s) in enumerate(zip(p["blocks"], p["summaries"]), 1):
ratio = len(s) / max(len(b), 1) * 100
print(f"▶ BLOCK {i} ({len(b):,} chars)")
print(textwrap.indent(...))
print(f"◀ MEMENTO {i} ({len(s):,} chars · {ratio:.1f}%)")
Outputs side-by-side: Block verbosity vs. memento brevity. Exercise: Tweak keep_last_k=2; measure KV cache savings.
Pitfall: Forgetting specials in tokenizer—distorts counts. Good output: Compressed trace parses back to ~90% original info.
Format for Supervised Fine-Tuning
Convert to chat ML:
def to_chat(ex):
return {"messages": [
{"role": "user", "content": ex["problem"]},
{"role": "assistant", "content": ex["response"]},
]}
chat_stream = load_dataset(...).map(to_chat)
Stream full subset for extras (sentence alignments). Principle: SFT-ready preserves tokens for LoRA/PEFT; compression cuts costs 4-6x.
"Trace-level token compression for this example: block tokens = 1200, memento tokens = 200, compression = 6.00× (paper reports ~6×)"
"Analyzed 500 rows. Domain counts: code 180, math 150... Per-domain medians (ratio = mementos / blocks): code 0.167 char ratio"
"Original: 8,452 chars, Compressed: 2,134 chars (25.3% of original)"
Key Takeaways
- Stream OpenMementos with
load_dataset(..., streaming=True)to analyze without full download. - Use regex
BLOCK_RE,SUMMARY_REto parse blocks/mementos; validate 1:1 pairing. - Compute compression:
sum(len(s.split()) for s in summaries) / sum(len(b.split()) for b in blocks); expect 4-6x tokens. - Simulate inference:
compress_trace(keep_last_k=1)replaces early blocks with mementos. - Add special tokens to tokenizer before
tlen()for accurate counts. - Render traces with
textwrap.indent()for manual review of block-memento fidelity. - Map to
{"messages": [...chat format]}for direct SFT pipelines. - Group by
domainin Pandas; math/code differ in trace depth—tailor analysis. - Practice: Process 1k samples, plot
compress_wordhistograms per domain. - Scale: Align streamed data with full subset fields for richer annotations.