The Emergence of SpaceX as a 'Neocloud'

SpaceX has rapidly scaled into a major compute provider, securing massive GPU rental agreements that position it as a critical infrastructure layer. Recent reports indicate a $6.3 billion deal with Reflection AI for GB300 access, following similar large-scale contracts with Anthropic and Google. These deals, totaling approximately $2.32 billion per month, annualize to $28 billion in revenue—roughly double the current revenue of Coreweave. This 'neocloud' model, characterized by high-density Blackwell deployments and significant compute leasing, suggests that GPU brokerage is becoming a strategic market that bridges the gap between hardware supply and frontier model builders.

The Shift Toward 'Owned Intelligence'

There is a clear trend among enterprise application builders—such as Cursor, Notion, and Harvey—to move away from pure API reliance toward 'owned intelligence.' This involves running open-weight or specialized models that companies can post-train on their own data and evals. Baseten’s recent $1.5B Series F funding highlights this shift, as enterprises seek to retain control over their model stack to ensure reliability and continual learning.

GLM-5.2 has emerged as the leading open-weight model for these agentic workflows. Benchmarks and real-world harnesses (such as those performed in the Cline repository) show it is competitive with proprietary frontier models. While it may be slower and more tool-call-heavy than closed alternatives, it offers significant cost advantages ($0.41 vs $0.81 in specific bug-fixing tasks) and robust verification capabilities. The ecosystem has responded with high-velocity adoption, integrating GLM-5.2 into AWS Marketplace, Baseten, and various agentic frameworks.

Evolving Agentic Infrastructure

Agent engineering is moving beyond simple chat interfaces toward stateful, tool-rich, long-running workflows. Google’s promotion of the 'Interactions API' as the default for Gemini agents reflects this, offering background async execution and remote sandboxing (Antigravity). Simultaneously, the developer community is standardizing around agent communication protocols and harness ergonomics, as seen in the growth of projects like Hermes.

However, this shift has exposed weaknesses in current evaluation methodologies. Audit data across 21 judges and 541K judgments indicates that 'exact-match' agreement significantly overstates judge quality; using metrics like Cohen’s kappa reveals much lower agreement. The industry is increasingly recognizing that evaluating agents requires testing system behavior under tools, memory, and long-horizon execution, rather than relying on static, single-turn benchmark scores.