AGI by 2030? ARC-AGI V3, Agentic Intelligence, and the Shift Beyond LLMs
Y Combinator
March 27, 2026

AGI by 2030? ARC-AGI V3, Agentic Intelligence, and the Shift Beyond LLMs

🚀 The Big Picture

Frontier AI is entering a new regime. The most credible near-term path to broad automation runs through verifiable domains (code now, mathematics next), while genuine general intelligence will demand a new learning substrate—one that optimizes for sample efficiency, symbolic compression, and agentic exploration, not just more scale. The ARC-AGI benchmark series has quietly become the market’s best early-warning system for these step-changes in capability—and with the release of ARC-AGI V3, the focus shifts from passive pattern modeling to agentic intelligence.

“We’re probably looking at AGI 2030, early 2030s most likely—around the time ARC 6 or ARC 7 is released.”

Two AGIs: Automation vs. Intelligence

  • Economic AGI: Systems that can automate most economically valuable tasks. In verifiable domains, current stacks can already reach or exceed human level.
  • General Intelligence: Systems that approach any new problem with human-level skill acquisition efficiency—needing about as little data and compute as humans. This likely requires a different substrate than today’s LLM-centric stack.
“It’s already true that in principle current technology can fully automate at human level or beyond any domain where you have verifiable rewards. Code is the first domain to fall.”

Why Code Agents Exploded—and What Comes Next

The breakthrough with coding agents surprised even seasoned researchers. The unlock: verifiable reward signals. Unit tests and compilers provide ground truth, powering a post-training loop of generate → verify → fine-tune → repeat. Where such signals exist, fully automated systems are within reach.

“Any domain where the solutions can be formally verified can be fully automated with current technology.”

Expect mathematics to follow for the same reason. By contrast, non-verifiable domains (e.g., essay writing) face slower progress because feedback relies on costly human annotations rather than dense, trustworthy signals.

LLMs Hit a Wall; A New Substrate Aims for Optimality

Years of scaling revealed a core limitation: gradient descent struggles to find generalizable algorithms even when models can represent them. The result is sophisticated pattern matching over tokens rather than compact causal programs.

NDIA’s approach targets that gap with program synthesis and a new engine for symbolic models—as small as possible—to explain data. Think “symbolic descent” (the symbolic analogue to gradient descent) guided by deep models. The aim: models that are far more data-efficient, smaller at inference, and more compositional, consistent with the minimum description length principle.

“Science is fundamentally a symbolic compression process… not curve fitting. We’re building the scientific method in algorithmic form.”

🧭 ARC-AGI as a Market Barometer

Over the last half decade, ARC-AGI has mapped the field’s real capability shifts:

  • ARC V1: Base LLMs scored sub 10%, with the original GPT-3 at zero, despite ~50,000x scale-up. The dam broke only with reasoning models (e.g., OpenAI’s o1, then o3), creating a step-function jump. The o3 preview in December 2024 was expensive and lacked immediate PMF, but ARC signaled its significance.
  • ARC V2: Initially difficult, then rapidly saturated as labs built harnesses to generate new tasks, verify solutions, and fine-tune on successful chains. This verifiable RL-style loop mined the space at scale. A company from YC’s winter ’26 batch reportedly reached 97% using a cost-efficient harness approach.
“It’s not that the models are smarter. It’s that they’re suddenly more useful.”

Key lesson: post-training pipelines can deliver huge usefulness gains in specific domains without increasing fluid intelligence.

🕹️ ARC-AGI V3: Measuring Agentic Intelligence

V3 flips the script from passive modeling to interactive exploration. Agents drop into mini video games with no instructions, no stated goals, and unknown controls. They must actively explore, infer the rules, set goals, plan, and execute—then get scored on efficiency.

  • Human baseline: People solve these environments with few hundreds to thousands of interactions; every action in V3 is counted against the score.
  • Content: A bespoke studio built 250+ unique games, typically ~10 minutes on first contact. No language, cultural symbols, or external knowledge; only core priors (objects, agents, basic physics).
  • Anti-targeting design: The private set differs significantly from the public set, and the public set is intentionally easier; overfitting to public won’t generalize.
“You’re being evaluated on games you’re seeing for the very first time… every action you spend exploring is counted toward your efficiency score.”

Cost, Scale, and the Shape of Future Systems

V3’s emphasis on efficiency aligns with a broader industry tension: useful automation today often comes from post-training harnesses and big spend; optimal intelligence tomorrow requires concise models and human-level learning efficiency.

  • On cost: A claim in the discussion suggested an ARC-like task could be handled for .3 cents using a symbolic engine versus $1 to $10 on a foundation model—spotlighting the efficiency gap these new substrates aim to close.
  • On architecture: Expect a small, “engine + knowledge” split—fluid intelligence as a compact core, with a larger knowledge base layered beneath.
“AGI will turn out to be a code base that’s less than 10,000 lines of code… probably on the order of megabytes. Had we known, it could have been built with 1980s compute.”

🧪 Beyond the LLM Monoculture

Concentration risk is rising as “billions of dollars” flow into a single stack. The call: diversify research bets. Compute is a great equalizer—similar capital applied to alternative paradigms could surface new breakthroughs.

  • Near the stack: State space models; XLSM; recurrent alternatives to transformers.
  • Below the stack: Replace gradient descent with search, evolution, or genetic algorithms.
  • Foundations: Abandon parametric curves; pursue symbolic program synthesis for optimality.
“If you had thrown the same investment into almost anything else, you’d also have seen extremely exciting results. Compute is the great equalizer.”

Inside NDIA: Building a Compounding Stack

NDIA began with a clear thesis—deep learning–guided program search over symbolic spaces, akin to AlphaZero’s principles. Multiple approaches were tried before converging—about half a year—on reusable foundations that compound capability.

“You want a compounding stack. Don’t commit to the foundation too early, but make sure you’re building reusable layers.”

ARC-AGI Roadmap: V4 and V5

  • ARC 4: In the spirit of V3, but focused on continual and curriculum learning across longer time scales—fewer games, more levels, knowledge reuse across levels.
  • ARC 5: “Very new and different” and centered on invention.

When ARC can no longer measure a residual human–AI gap, that’s the AGI moment—and the benchmark baton passes to machines.

“Eventually we will run out of things to test… When it becomes impossible to measure the gap, this is the AGI moment.”

Open-Source Leverage: Lessons from Keras

Keras launched in March 2015 (exactly 11 years before the conversation date noted) and grew by prioritizing usability end-to-end: simple APIs, excellent docs that teach the domain, and community building. A practical tip for maintainers enjoying sudden traction—one project cited had 40,000 stars and 100+ PRs within days:

“Hire your power users—hire your fans.”

Actionable Takeaways

  • Builders: Target verifiable domains; use harnesses to generate data, verify, and fine-tune at scale—but recognize this increases usefulness, not fluid intelligence.
  • Researchers: Optimize for sample efficiency and symbolic compression. Explore program synthesis, genetic algorithms, and search-driven training. Aim for systems that improve without human bottlenecks.
  • Strategists/Investors: Treat ARC as a leading indicator. V1 flagged reasoning; V2 captured agentic coding and verifiable loops; V3 will reveal whether agentic exploration is emerging. Don’t conflate post-training gains with general intelligence.
  • Careers: Deepen domain and AI literacy. The opportunity is to leverage the tools, not resist them.
“You’re not going to stop AI progress… The question is: how do you make use of it? How do you ride the wave?”

Odds, Ends, and Open Questions

  • Success probabilities for radical approaches can be low—10–15%—and still worth pursuing if no one else will.
  • OpenAI Five’s restricted Dota 2 training leaned on tens of thousands of hours (and “maybe in millions” of steps) on the same environment—precisely what ARC V3 avoids by scoring first-contact exploration efficiency.
  • If .3 cents per task vs. $1–$10 holds in practice for symbolic engines, entire economic frontiers open up in automation.

The center of gravity is shifting: from bigger models to better substrates, from static benchmarks to agentic exploration, and from usefulness via harnesses to learning efficiency that looks more like science than statistics.

More from Y Combinator