Stream Parse TaskTrove Dataset for AI Task Insights

Build Streaming Parser for Compressed Task Binaries

Handle TaskTrove's task_binary fields—gzip-compressed blobs up to p95= some KB—without downloading the full dataset by using datasets.load_dataset(..., streaming=True). Convert blobs to bytes via to_bytes() which decodes base64 strings or lists. Decompress if gzip header (b'\x1f\x8b'), then auto-detect format in parse_task(): prioritize tarfile.open() for archives (extract files as str/bytes), fall back to ZipFile, then json.loads() (or JSONL line-by-line), plain text decode, or binary. This yields dicts with format, files (for archives), content, plus raw_size/compressed_size. Example: first sample decompresses from compressed bytes to raw, revealing tar with JSON metadata and .py code files.

Use show_task() to preview: breakdown by extension (e.g., .json, .py), truncate JSON to 1500 chars, code to 600. Trade-off: Streaming processes samples in real-time but requires robust error handling for malformed blobs (e.g., UnicodeDecodeError keeps as bytes).

Uncover Dataset Structure via Counters and Plots

Extract source from path prefix (split on last '-'): top 15 sources dominate test split (e.g., count thousands each). Track compressed sizes: log-scale histogram shows median p50 KB, p95 ~higher KB—most tasks compact, outliers bulkier. Inspect 200 samples: common filenames (e.g., task.json, README.md top counts), JSON keys (e.g., instruction, tests frequent). Full listings reveal 5-10 files per tar/zip typically.

Aggregate in TaskTroveExplorer.summary(limit=1000): group by source for n tasks, mean compressed/raw KB (log y-scale bar chart top 12), mean files. Enables quick profiling—e.g., some sources average 10+ KB raw, others leaner. Polars DataFrame slice of 500 tasks captures source, is_verified, sizes, instruction preview for downstream modeling.

Detect Verifiers and Export RL-Ready Tasks

Flag evaluation-ready tasks with has_verifier(): scan filenames for 'verifier'/'judge'/'grader', JSON keys like 'verifier_config'/'rubric'/'test_patch', or content strings. Multi-signal boosts recall—e.g., verified tasks have dedicated verifier.py or JSON. Per-source rates vary (bar chart: green high % usable for RL); hunt first verified sample to inspect (e.g., grader JSON with tests).

TaskTroveExplorer class unifies: iter() filters sources, sample(n=5) parses + adds metadata, export() writes dirs with files/JSON. Saves Parquet slice (500 rows, ~KB): boosts workflows by filtering verified tasks (sum across sources). Full pipeline scales to validation split; lists HF repo subdirs for all sources (~dozens).