Build Magika + OpenAI File Security Pipeline

Initialize Tools for Byte-Level Detection

Start by installing magika and openai via !pip install magika openai -q. Securely input your OpenAI API key using getpass and initialize the OpenAI client: verify connection with client.models.list(). Load Magika with m = Magika() and check its capabilities: m.get_model_name(), m.get_module_version(), and supported labels via m.get_output_content_types(). This setup bypasses filename/extension reliance, using deep learning on raw bytes for robust detection—critical because extensions can be spoofed.

Define a reusable ask_gpt function for prompting:

def ask_gpt(system: str, user: str, model: str = "gpt-4o", max_tokens: int = 600) -> str:
    resp = client.chat.completions.create(
        model=model, max_tokens=max_tokens, messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
    )
    return resp.choices[0].message.content.strip()

This enables GPT to contextualize Magika outputs, e.g., explaining detection: "Explain how a deep-learning model detects file types from just bytes, and why this beats relying on file extensions."

Principle: Magika's model analyzes byte patterns (magic numbers, headers) with a single confidence score applied post-thresholding. Raw dl.* fields show unprocessed model output; output.* are finalized (label, MIME, group, extensions, is_text).

Common Mistake: Using outdated Magika APIs (e.g., MagikaConfig—nonexistent; use constructor Magika(prediction_mode=...); res.output_score → res.score).

Single and Batch Scanning with Project Inference

For single files: res = m.identify_bytes(raw_bytes) or m.identify_paths([paths]). Extract res.output.label, res.score, res.output.mime_type. Test on samples like Python shebang (#!/usr/bin/env python3), ZIP magic bytes (0x50 0x4B 0x03 0x04), yielding labels like python, zip with scores >90%.

Batch scan temp files:

tmp_dir = Path(tempfile.mkdtemp())
# Write sample files: code.py, style.css, data.json, etc.
paths = [tmp_dir / fname for fname in file_specs]
results = m.identify_paths(paths)
batch_summary = [{"file": p.name, "label": r.output.label, "group": r.output.group, "score": f"{r.score:.1%}"} for p, r in zip(paths, results)]

GPT infers project type: Prompt as DevSecOps expert to summarize codebase (e.g., web app with Python/JS/CSS/SQL) and flag scrutiny needs (e.g., shell scripts).

Quality Criteria: High scores (>95%) indicate reliable labels; group (e.g., text, archive) aids categorization. Use for repository audits.

Before/After: Extension-based: script.sh → shell; bytes-based: catches spoofs.

Manage Ambiguity with Prediction Modes and Result Inspection

Ambiguous inputs (e.g., plain text) vary by mode:

for mode in [PredictionMode.HIGH_CONFIDENCE, PredictionMode.MEDIUM_CONFIDENCE, PredictionMode.BEST_GUESS]:
    m_mode = Magika(prediction_mode=mode)
    res = m_mode.identify_bytes(ambiguous_bytes)

HIGH_CONFIDENCE: Strict thresholding (e.g., text/plain only if >threshold); BEST_GUESS: More permissive.

GPT guidance: HIGH for blocking uploads (avoid false positives); MEDIUM for triage; BEST_GUESS for forensics.

Dissect MagikaResult:

output.label: Post-processed (e.g., python)
dl.label: Raw model (may differ pre-threshold)
Single res.score applies to both.

Principle: Threshold logic refines raw predictions; inspect both for debugging. GPT clarifies: "dl.* are raw; output.* finalized—differences arise from confidence filters."

Exercise: Probe prefixes (4-512 bytes) on Python script: Detects python from shebang in <32 bytes due to header patterns.

Detect Spoofs and Analyze Distributions for Threats

Spoof test:

for fname, content in spoofed_files.items():
    res = m.identify_bytes(content)
    detected = res.output.label
    match = detected == expected_from_ext

Flags mismatches (e.g., invoice.pdf → python; photo.jpg → html). GPT assesses: "Python-in-PDF: Likely webshell injection—quarantine and scan AV."

Corpus distribution: Scan mixed snippets (SQL, HTML, Python, etc.), count groups/labels with Counter. GPT infers: Polyglot repo (multi-lang); watch for unmaintained langs.

Trade-off: Magika excels on headers (few bytes) but needs full content for edge cases; pairs with GPT for semantic threat vectors.

Build Upload Pipeline with Risk-Based Decisions

Simulate uploads:

upload_dir = Path(tempfile.mkdtemp()) / "uploads"
# Write uploads: report.pdf, malware.exe, etc.
batch_results = m.identify_paths(list(upload_dir.iterdir()))
BLOCKED_LABELS = {"pe", "elf", "macho"}  # Binaries
for path, res in zip(all_paths, batch_results):
    status = "🚫 BLOCKED" if res.output.label in BLOCKED_LABELS else "✅ OK"  # Or mismatch flag

GPT risk score: Identifies malware.exe (PE binary), suspicious.txt (MZ header)—recommend sandbox/AV scan.

Forensics: Hash prefixes (hashlib.sha256), log MIME/is_text. GPT crafts IOC narrative: "Sample_E (MZ): PE dropper in attack chain—hash for threat intel feeds."

Principle: Combine type/group with extension checks; block executables outright.

Generate Structured Reports and Executive Insights

Compile JSON:

report = [{
    "filename": name,
    "label": o.label,
    "description": o.description,
    "mime_type": o.mime_type,
    # ... score, dl_label, etc.
} for each]
with open("/tmp/report.json", "w") as f:
    json.dump({"scan_results": report, "exec_summary": exec_summary}, f)

GPT as CISO: Paragraph 1: Findings/risk (e.g., "Two spoofs, one binary—medium risk."); Paragraph 2: Steps ("Re-scan, update policies").

Template: Export includes raw + interpreted data for audits.

Quotes:

GPT on Magika: "A deep-learning model detects file types from bytes by learning magic numbers, headers, and statistical patterns—far superior to extensions, which attackers spoof easily." (Core API explanation)
GPT on modes: "HIGH_CONFIDENCE for production uploads to minimize false positives; MEDIUM for batch triage; BEST_GUESS for exploratory forensics." (Mode guidance)
GPT threat: "data.csv as ZIP: Archive bomb potential—extract safely in sandbox before processing." (Spoof assessment)
GPT risk: "Highest-risk: malware.exe (PE executable)—block and alert; spoof.pdf (Python script)—potential RCE via inclusion." (Upload pipeline)
GPT exec: "Overall risk posture: Moderate due to binaries and spoofs; no immediate breach but policy gaps exposed." (Summary)

Key Takeaways

Install Magika/OpenAI, init with API key; use identify_bytes/paths for extension-agnostic detection.
Batch scan directories; Counter groups/labels to infer repo types via GPT.
Tune prediction_mode per use: HIGH for security gates, BEST_GUESS for analysis.
Flag spoofs (detected != ext) and block binaries (pe/elf/macho); GPT for threat narratives.
Probe minimal bytes (often <64) via prefixes—leverages header patterns.
Export JSON with output.* + dl.* + GPT summaries for forensics/audits.
Always inspect MagikaResult.score (>90% reliable); pair with hashing for IOCs.
Avoid old APIs: Constructor for modes, single res.score.
Practice: Build upload handler integrating this pipeline in Flask/FastAPI.
Scale: Corpus analysis reveals maintainability risks (e.g., too many langs).