Build Magika + OpenAI File Security Pipeline
Use Google's Magika for accurate byte-level file type detection and GPT-4o to generate security insights, risk scores, and reports—turning raw scans into actionable intelligence for uploads, forensics, and audits.
Initialize Tools for Byte-Level Detection
Start by installing magika and openai via !pip install magika openai -q. Securely input your OpenAI API key using getpass and initialize the OpenAI client: verify connection with client.models.list(). Load Magika with m = Magika() and check its capabilities: m.get_model_name(), m.get_module_version(), and supported labels via m.get_output_content_types(). This setup bypasses filename/extension reliance, using deep learning on raw bytes for robust detection—critical because extensions can be spoofed.
Define a reusable ask_gpt function for prompting:
def ask_gpt(system: str, user: str, model: str = "gpt-4o", max_tokens: int = 600) -> str:
resp = client.chat.completions.create(
model=model, max_tokens=max_tokens, messages=[
{"role": "system", "content": system},
{"role": "user", "content": user},
],
)
return resp.choices[0].message.content.strip()
This enables GPT to contextualize Magika outputs, e.g., explaining detection: "Explain how a deep-learning model detects file types from just bytes, and why this beats relying on file extensions."
Principle: Magika's model analyzes byte patterns (magic numbers, headers) with a single confidence score applied post-thresholding. Raw dl.* fields show unprocessed model output; output.* are finalized (label, MIME, group, extensions, is_text).
Common Mistake: Using outdated Magika APIs (e.g., MagikaConfig—nonexistent; use constructor Magika(prediction_mode=...); res.output_score → res.score).
Single and Batch Scanning with Project Inference
For single files: res = m.identify_bytes(raw_bytes) or m.identify_paths([paths]). Extract res.output.label, res.score, res.output.mime_type. Test on samples like Python shebang (#!/usr/bin/env python3), ZIP magic bytes (0x50 0x4B 0x03 0x04), yielding labels like python, zip with scores >90%.
Batch scan temp files:
tmp_dir = Path(tempfile.mkdtemp())
# Write sample files: code.py, style.css, data.json, etc.
paths = [tmp_dir / fname for fname in file_specs]
results = m.identify_paths(paths)
batch_summary = [{"file": p.name, "label": r.output.label, "group": r.output.group, "score": f"{r.score:.1%}"} for p, r in zip(paths, results)]
GPT infers project type: Prompt as DevSecOps expert to summarize codebase (e.g., web app with Python/JS/CSS/SQL) and flag scrutiny needs (e.g., shell scripts).
Quality Criteria: High scores (>95%) indicate reliable labels; group (e.g., text, archive) aids categorization. Use for repository audits.
Before/After: Extension-based: script.sh → shell; bytes-based: catches spoofs.
Manage Ambiguity with Prediction Modes and Result Inspection
Ambiguous inputs (e.g., plain text) vary by mode:
for mode in [PredictionMode.HIGH_CONFIDENCE, PredictionMode.MEDIUM_CONFIDENCE, PredictionMode.BEST_GUESS]:
m_mode = Magika(prediction_mode=mode)
res = m_mode.identify_bytes(ambiguous_bytes)
HIGH_CONFIDENCE: Strict thresholding (e.g., text/plain only if >threshold); BEST_GUESS: More permissive.
GPT guidance: HIGH for blocking uploads (avoid false positives); MEDIUM for triage; BEST_GUESS for forensics.
Dissect MagikaResult:
output.label: Post-processed (e.g.,python)dl.label: Raw model (may differ pre-threshold)- Single
res.scoreapplies to both.
Principle: Threshold logic refines raw predictions; inspect both for debugging. GPT clarifies: "dl.* are raw; output.* finalized—differences arise from confidence filters."
Exercise: Probe prefixes (4-512 bytes) on Python script: Detects python from shebang in <32 bytes due to header patterns.
Detect Spoofs and Analyze Distributions for Threats
Spoof test:
for fname, content in spoofed_files.items():
res = m.identify_bytes(content)
detected = res.output.label
match = detected == expected_from_ext
Flags mismatches (e.g., invoice.pdf → python; photo.jpg → html). GPT assesses: "Python-in-PDF: Likely webshell injection—quarantine and scan AV."
Corpus distribution: Scan mixed snippets (SQL, HTML, Python, etc.), count groups/labels with Counter. GPT infers: Polyglot repo (multi-lang); watch for unmaintained langs.
Trade-off: Magika excels on headers (few bytes) but needs full content for edge cases; pairs with GPT for semantic threat vectors.
Build Upload Pipeline with Risk-Based Decisions
Simulate uploads:
upload_dir = Path(tempfile.mkdtemp()) / "uploads"
# Write uploads: report.pdf, malware.exe, etc.
batch_results = m.identify_paths(list(upload_dir.iterdir()))
BLOCKED_LABELS = {"pe", "elf", "macho"} # Binaries
for path, res in zip(all_paths, batch_results):
status = "🚫 BLOCKED" if res.output.label in BLOCKED_LABELS else "✅ OK" # Or mismatch flag
GPT risk score: Identifies malware.exe (PE binary), suspicious.txt (MZ header)—recommend sandbox/AV scan.
Forensics: Hash prefixes (hashlib.sha256), log MIME/is_text. GPT crafts IOC narrative: "Sample_E (MZ): PE dropper in attack chain—hash for threat intel feeds."
Principle: Combine type/group with extension checks; block executables outright.
Generate Structured Reports and Executive Insights
Compile JSON:
report = [{
"filename": name,
"label": o.label,
"description": o.description,
"mime_type": o.mime_type,
# ... score, dl_label, etc.
} for each]
with open("/tmp/report.json", "w") as f:
json.dump({"scan_results": report, "exec_summary": exec_summary}, f)
GPT as CISO: Paragraph 1: Findings/risk (e.g., "Two spoofs, one binary—medium risk."); Paragraph 2: Steps ("Re-scan, update policies").
Template: Export includes raw + interpreted data for audits.
Quotes:
- GPT on Magika: "A deep-learning model detects file types from bytes by learning magic numbers, headers, and statistical patterns—far superior to extensions, which attackers spoof easily." (Core API explanation)
- GPT on modes: "HIGH_CONFIDENCE for production uploads to minimize false positives; MEDIUM for batch triage; BEST_GUESS for exploratory forensics." (Mode guidance)
- GPT threat: "data.csv as ZIP: Archive bomb potential—extract safely in sandbox before processing." (Spoof assessment)
- GPT risk: "Highest-risk: malware.exe (PE executable)—block and alert; spoof.pdf (Python script)—potential RCE via inclusion." (Upload pipeline)
- GPT exec: "Overall risk posture: Moderate due to binaries and spoofs; no immediate breach but policy gaps exposed." (Summary)
Key Takeaways
- Install Magika/OpenAI, init with API key; use
identify_bytes/pathsfor extension-agnostic detection. - Batch scan directories; Counter groups/labels to infer repo types via GPT.
- Tune
prediction_modeper use: HIGH for security gates, BEST_GUESS for analysis. - Flag spoofs (detected != ext) and block binaries (pe/elf/macho); GPT for threat narratives.
- Probe minimal bytes (often <64) via prefixes—leverages header patterns.
- Export JSON with
output.*+dl.*+ GPT summaries for forensics/audits. - Always inspect
MagikaResult.score(>90% reliable); pair with hashing for IOCs. - Avoid old APIs: Constructor for modes, single
res.score. - Practice: Build upload handler integrating this pipeline in Flask/FastAPI.
- Scale: Corpus analysis reveals maintainability risks (e.g., too many langs).