LLMs are compute. Context is memory. Inference speed is king.

In the last 5 years, we’ve seen a dramatic shift — AI has gone from being an accelerator or add-on to a utility. We went from classification and prediction models that enrich your social feed, shopping apps, maps and driving experience to vibe coding and agents which are fundamentally new applications with AI foundation models (LLMs) as their lifeblood. So, if AI is now a utility, is it possible to build not only applications but also infrastructure with it? And if so, how do we go about it?

I think of LLMs as a new type of compute platform that you program with English. The program and the data that an LLM operates on at any point in time is limited to its context. Context, therefore, is like main memory — it’s precious and should be used wisely. More importantly, programming LLMs today is like programming computers in the 1960s. It’s the Wild West. We are still experimenting with the abstractions to make the software built on LLMs reliable, reusable, and scalable.

This perspective is not new — Andrej Karpathy speaks about Software 3.0, LLMs as an operating system for multiplexing tools, and context engineering in lieu of prompt engineering. He asks how best to orchestrate the context. Joe Hellerstein calls LLMs the new microprocessor and asks what software abstractions will be built on top of them. To answer these questions, we must first see how AI is scaling.

Inference-time scaling is all the rage. The race to make AI smarter is no longer about more data and more parameters. We’re running out of data and the marginal benefits of making models bigger have hit diminishing returns. So, instead we’re using “thinking” to guide models into taking the right steps, and reinforcement learning to unearth latent capabilities in the models. Both of these stress inference speed, so all the energy is going there.

Inference speed is getting faster at nearly 10x per year for a given class of model — from ~15 tokens / sec in 2023, to ~150 tokens / sec in 2024, to ~1500 tokens / sec in 2025, and ~3000 tokens / sec in early 2026. We’re building processors optimized for inference like Google’s TPUs, Groq’s LPUs, and Cerebras’ WSE as well as optimizing software to drive this unprecedented growth rate.

So, what does this all mean for infrastructure like databases? I argue it has three implications:

First, the most prosaic implication is that we’ll need to build databases for LLMs. Instead of 1-2 queries/sec per user, agents today can generate ~100 queries/sec/user, and they will only get faster. DBs will need to keep up.

The second and more interesting implication is that we will likely build databases with LLMs. You may think that vibe coding is for apps and perhaps pure mathematics, but not for serious stuff like databases. I argue that we will not only refactor large components of industrial strength databases, but also automatically generate purpose-built databases for every application. This is like self-driving databases on steroids. LLMs will be able to analyze application code and their workload, and, for example, automatically compile the most efficient data structures for queries and minimal coordination mechanisms needed for consistency and reliability.

Finally, we will build all of our analytics engines (i.e. data warehouses) on LLMs and unstructured data will be first-class. To be clear, we’re not talking about adding LLMs onto existing architectures. It’s more fundamental — LLMs will be the compute substrate that analytics engines will be built upon. By 2036, frontier models will offer processing speeds of 1.5B+ tokens/sec. At this rate, you can analyze all of Shakespeare in under 1 ms, English Wikipedia in under 5 secs, and all public PDFs in under 2 mins. I argue that today’s agent-based “answer engines” are built wrong for this future. Instead of pre-processing to build complex vector indexes or knowledge graphs which are inaccurate and hard to maintain, they should focus on scanning and sifting through all relevant data with LLMs at query time – ala map-reduce. They should build the context necessary just-in-time to answer questions or complete a task accurately.

Inference speed is the new gravity. It will drive all database and analytics architectural decisions for the next decade. There’s no use in fighting it. So, let’s get started.

Monday Morning Tech and Life

Musings on technology, life, and trends

LLMs are compute. Context is memory. Inference speed is king.