Summaries · Inference & Serving

DAY 01Today JUN 29 · 20265 SUMMARIES

Latent Space (Newsletter)Inference & ServingJun 29, 2026

SpaceX's Neocloud and the Rise of Owned Intelligence

SpaceX is emerging as a massive compute provider with $28B/year in annualized GPU rental deals, while developers increasingly prioritize 'owned intelligence' via open-weight models like GLM-5.2 to gain control over their AI stacks.

Latent Space (Newsletter)

Simon Willison's WeblogInference & ServingJun 29, 2026

Porting PyTorch Models to the Browser with Claude Code

By leveraging Claude Code to convert PyTorch models to ONNX, developers can run sophisticated AI features like image inpainting directly in the browser using WebGPU and the CacheStorage API.

Together AI BlogInference & ServingJun 29, 2026

ParallelKernelBench: Frontier LLMs Struggle with Multi-GPU Kernels

While LLMs excel at single-GPU kernel generation, they currently struggle with multi-GPU tasks where communication bottlenecks and complex rank coordination dominate performance.

Hugging Face BlogInference & ServingJun 29, 2026

Deploying vLLM Endpoints on Hugging Face Jobs

Hugging Face Jobs allows engineers to spin up private, OpenAI-compatible vLLM endpoints on demand using a single command, providing a pay-per-second alternative for testing and experimentation.

AI EngineerInference & ServingJun 29, 2026

Prototype Big, Deploy Small: A Framework for Local LLM Adoption

Stop overpaying for frontier models. By using a 'prototype big, deploy small' framework and rigorous capability evals, you can identify 'Sage' (Small and Good Enough) models that provide production-grade performance on-device, saving costs and improving latency.

DAY 02Yesterday JUN 28 · 20262 SUMMARIES

Machine Learning Street TalkInference & ServingJun 28, 2026

Thermodynamic Computing and the Future of AI-Driven Chip Design

Thomas Ahle of Normal Computing discusses using AI agents to automate chip design, the risks of 'understanding debt' in agentic code, and the development of thermodynamic chips that use physical noise to perform stochastic computations.

Machine Learning Street Talk

OpenAI NewsInference & ServingJun 28, 2026

OpenAI and Broadcom Unveil Jalapeño Inference Chip

OpenAI and Broadcom have developed 'Jalapeño,' a custom ASIC designed specifically for LLM inference, aiming to improve performance-per-watt and reduce latency through hardware-software co-design.

DAY 03Friday JUN 26 · 20262 SUMMARIES

TechCrunch — AIInference & ServingJun 26, 2026

The Strategic Shift Toward Custom AI Silicon

Major tech players are developing custom chips to mitigate single-supplier risk, optimize hardware for specific workloads, and achieve performance gains similar to Apple's transition away from Intel.

TechCrunch — AI

IBM TechnologyInference & ServingJun 26, 2026

Scaling Beyond 2D: IBM’s Nano Stack and the Rise of Orchestration

IBM introduces a 0.7nm 'nano stack' chip architecture to overcome 2D scaling limits, while the panel debates the shift from monolithic model development to multi-model orchestration as the new frontier for AI performance.

DAY 04Thursday JUN 25 · 20261 SUMMARIES

Google Cloud TechInference & ServingJun 25, 2026

Scaling AI Agents and Inference on Google Cloud Run

Google Cloud Run is evolving from a web-service platform into a comprehensive runtime for AI agents, inference, and background tasks, introducing features like GPU support, sandboxed code execution, and custom scaling controls.

Google Cloud Tech

DAY 05June 30, 2025 JUN 30 · 20251 SUMMARIES

Hugging Face BlogInference & ServingJun 30, 2025

Optimizing Browser AI with Cross-Origin Storage

The proposed Cross-Origin Storage (COS) API allows web apps to share large AI model and Wasm files across different origins using cryptographic hashes, eliminating redundant downloads and storage.

Hugging Face Blog