Inference & Serving
All things Inference & Serving on Edge.
SpaceX's Neocloud and the Rise of Owned Intelligence
SpaceX is emerging as a massive compute provider with $28B/year in annualized GPU rental deals, while developers increasingly prioritize 'owned intelligence' via open-weight models like GLM-5.2 to gain control over their AI stacks.
Porting PyTorch Models to the Browser with Claude Code
By leveraging Claude Code to convert PyTorch models to ONNX, developers can run sophisticated AI features like image inpainting directly in the browser using WebGPU and the CacheStorage API.
ParallelKernelBench: Frontier LLMs Struggle with Multi-GPU Kernels
While LLMs excel at single-GPU kernel generation, they currently struggle with multi-GPU tasks where communication bottlenecks and complex rank coordination dominate performance.
Deploying vLLM Endpoints on Hugging Face Jobs
Hugging Face Jobs allows engineers to spin up private, OpenAI-compatible vLLM endpoints on demand using a single command, providing a pay-per-second alternative for testing and experimentation.
Prototype Big, Deploy Small: A Framework for Local LLM Adoption
Stop overpaying for frontier models. By using a 'prototype big, deploy small' framework and rigorous capability evals, you can identify 'Sage' (Small and Good Enough) models that provide production-grade performance on-device, saving costs and improving latency.
Thermodynamic Computing and the Future of AI-Driven Chip Design
Thomas Ahle of Normal Computing discusses using AI agents to automate chip design, the risks of 'understanding debt' in agentic code, and the development of thermodynamic chips that use physical noise to perform stochastic computations.
Machine Learning Street TalkOpenAI and Broadcom Unveil Jalapeño Inference Chip
OpenAI and Broadcom have developed 'Jalapeño,' a custom ASIC designed specifically for LLM inference, aiming to improve performance-per-watt and reduce latency through hardware-software co-design.
The Strategic Shift Toward Custom AI Silicon
Major tech players are developing custom chips to mitigate single-supplier risk, optimize hardware for specific workloads, and achieve performance gains similar to Apple's transition away from Intel.
Scaling Beyond 2D: IBM’s Nano Stack and the Rise of Orchestration
IBM introduces a 0.7nm 'nano stack' chip architecture to overcome 2D scaling limits, while the panel debates the shift from monolithic model development to multi-model orchestration as the new frontier for AI performance.
Scaling AI Agents and Inference on Google Cloud Run
Google Cloud Run is evolving from a web-service platform into a comprehensive runtime for AI agents, inference, and background tasks, introducing features like GPU support, sandboxed code execution, and custom scaling controls.
Google Cloud TechOptimizing Browser AI with Cross-Origin Storage
The proposed Cross-Origin Storage (COS) API allows web apps to share large AI model and Wasm files across different origins using cryptographic hashes, eliminating redundant downloads and storage.
Showing 11 of 11