Sitemap

Measuring Artificial Intelligence: ARC

The Next Frontier: Interactive Reasoning

3 min readJun 9, 2025
https://www.youtube.com/watch?v=AT3Tfc3Um20

Why AI Benchmarking Needs a Human Touch

The field of AI benchmarking is at a turning point. While we’ve seen impressive demos — like Claude and Gemini playing Pokémon or OpenAI’s game-playing agents — these feats don’t necessarily prove true intelligence. Models often get stuck, require interventions, or rely on pre-existing training data rather than genuine problem-solving. So how do we measure real progress toward artificial general intelligence (AGI)?

At ARC, a nonprofit research organization, we believe the answer lies in benchmarking AI against the only proven example of general intelligence we have: humans. Our approach is simple but radical. Instead of testing AI on narrow, predefined tasks, we design challenges that are intuitive for humans but difficult for machines. This creates a measurable gap — highlighting where AI falls short and guiding researchers toward meaningful improvements.

What Is General Intelligence, Really?

Two definitions shape our thinking. John McCarthy, one of AI’s founding figures, argued that true intelligence means solving problems the system hasn’t seen or prepared for. Memorization isn’t enough — it’s about adaptability. François Chollet, another influential thinker, distilled intelligence into three words: skill acquisition efficiency. In other words, how quickly can an agent learn something new and apply it? Humans excel at this, and that’s the standard we should aim for.

Our first benchmark, ARC v1, tested this idea with grid-based reasoning tasks. Participants saw an input-output transformation and had to deduce the underlying rule. The key? Every task was novel — no repeated patterns, no memorization. We even validated the benchmark by testing over 400 humans in person, ensuring every challenge was solvable by people.

The Next Frontier: Interactive Reasoning

But static tests only go so far. Human intelligence isn’t just about solving puzzles in isolation — it’s about exploring, experimenting, and adapting in open-ended environments. That’s why our next benchmark, ARC v3, shifts to interactive reasoning through games.

Games are perfect for this. They combine complex rules, clear objectives, and the need for exploration — all without requiring language or cultural knowledge. Unlike past benchmarks (like Atari), where developers could tweak models based on known games, ARC v3 introduces a private evaluation set. Neither the AI nor its creators will have seen these games beforehand, ensuring genuine generalization.

How It Works

Imagine being dropped into a game called Locksmith with no instructions. You’d have to explore, interact with objects, and piece together the rules yourself. That’s exactly what we’ll ask AI to do. No hints, no pre-training — just pure problem-solving.

To measure success, we’ll compare AI performance to human baselines. How many actions does it take to solve a game? How quickly can the agent adapt? If AI can’t outperform humans on these novel challenges, we can confidently say it hasn’t reached human-like intelligence.

What’s Next

We’re rolling out a preview soon, starting with five games at the World’s Fair in San Francisco, followed by a full release of 120+ games by 2026. But building this isn’t easy — we’re developing a lightweight Python engine and looking for game designers, adversarial testers, and supporters to help shape the future of AGI benchmarking.

The goal isn’t just to make AI better at games. It’s to create a true test of intelligence — one that pushes us closer to machines that think, learn, and explore like humans. And if we succeed, the next generation of AI won’t just play Pokémon — it’ll understand the world.

--

--

noailabs
noailabs

Written by noailabs

Tech/biz consulting, analytics, research for founders, startups, corps and govs.

No responses yet