Subquadratic Says SubQ Breaks the LLM Attention Bottleneck

June 20, 2026news

A Big Claim About LLM Efficiency

Miami-based startup Subquadratic says its new model, SubQ, attacks one of the most expensive problems in modern language models: the quadratic scaling cost of attention.

Most frontier LLMs rely on dense attention, where tokens are compared against many other tokens in the context window. That approach is powerful, but it becomes brutally expensive as context grows. Doubling the amount of text can push the required computation up by roughly four times, which is one reason long-context models are costly, slow, and energy-hungry.

Subquadratic argues that SubQ avoids much of that waste by using a sparse-attention architecture. Instead of comparing every token relationship, the model dynamically chooses which token interactions matter for the input it is processing.

Why Sparse Attention Matters

Sparse attention is not a new idea. Researchers have been trying for years to reduce the amount of attention computation without losing the model’s ability to understand long documents, codebases, and complex dependencies.

The hard part is selection. Fixed sparse patterns can miss important relationships because language is not predictable enough to decide in advance which tokens should interact. Subquadratic’s pitch is that SubQ makes those choices dynamically, allowing it to keep the useful signal while skipping a large share of the compute.

If that holds up in broader testing, the impact could be significant: cheaper inference, faster long-context work, and lower energy use for tasks that currently punish dense-attention models.

What the Early Numbers Suggest

Subquadratic says SubQ can handle context windows up to 12 million tokens, far beyond the one-million-token range common among many top systems today. That would make it useful for workloads such as reviewing hundreds of documents, searching massive internal knowledge bases, or analyzing large software repositories.

Independent benchmark results commissioned by the company reportedly show several strong signals:

  • SubQ ran dramatically faster than models using older attention-optimization techniques in a baseline speed test.
  • It scored 89.7% on LiveCodeBench, placing it near frontier coding-model territory.
  • It maintained around 98% accuracy on long-context retrieval tests at six-million and 12-million-token scales.
  • The company claims a large-context retrieval benchmark that can cost thousands of dollars on a premium frontier model cost only single-digit dollars on SubQ.

Those numbers are impressive, especially for developers watching the economics of long-context and agentic workloads. They also fit the broader trend covered in our piece on AI token costs forcing companies to rethink rollouts: model capability is only half the problem; cost per useful task matters just as much.

The Skepticism Is Still Reasonable

The caution is that benchmarks are not the same as broad availability. SubQ has not yet been widely opened for independent users to test across messy real-world workloads. Until that happens, it is hard to know whether the model is consistently strong or only highly optimized for a narrower set of coding and retrieval tasks.

There is also a technical wrinkle: SubQ was bootstrapped from weights derived from an existing open-source Qwen model rather than trained entirely from scratch. That does not invalidate the work, but it complicates the stronger claim that Subquadratic has fully reinvented the LLM architecture from the ground up.

So the practical read is cautious optimism. SubQ may not replace general-purpose frontier models across every category, but it could become important if it reliably delivers long-context reasoning and retrieval at a fraction of today’s cost.

What to Watch Next

The key milestone is access. If more developers and enterprise teams can test SubQ directly, the market will quickly learn whether its sparse-attention approach works outside controlled benchmarks.

For now, the most interesting part is not just the model itself. It is the direction of travel. The next wave of AI infrastructure may be less about making models bigger and more about making them computationally smarter. If Subquadratic is right, the transformer era may be entering its efficiency reckoning.