What Humans Do Without Thinking

Last week, the ARC Prize Foundation released ARC-AGI-3, the latest version of its benchmark for measuring machine intelligence. The format has changed. Instead of static puzzles, AI agents are dropped into interactive environments: hundreds of original turn-based games with no instructions, no rules, and no stated goals. The agent has to explore, figure out what's going on, discover what winning looks like, and adapt its strategy as the levels get harder [1].

The benchmark was designed around how humans perform on their first attempt. Every environment was selected to be solvable by people on first contact, without training or preparation. At launch, the best AI systems score below 1% [1].

That gap is striking. But this isn't really a story about what AI can't do. It's a story about something we do so effortlessly that we've stopped noticing it.

The ability we take for granted

Think about what the humans in those tests actually did. They were placed in an unfamiliar environment with no explanation. They looked around. They tried things. They noticed patterns. They formed a theory about what the game wanted, tested it, adjusted, and moved forward. They did this reliably, across thousands of sessions [2].

There is no clean word for this ability. "Intuition" gets close but sounds vague. "Reasoning" is too narrow, since it implies logic when what's actually happening involves perception, pattern recognition, risk tolerance, and something closer to curiosity.

If you build products or lead teams, you already know this ability matters. It's the thing that lets someone walk into a new project, a new market, or a broken process and start making useful decisions before they have complete information. And even after years of rapid progress, current AI systems still struggle badly when the task is truly unfamiliar.

What the benchmark actually measures

François Chollet, the researcher behind ARC, has argued for years that the industry measures AI progress the wrong way. Most benchmarks reward the ability to recall and recombine information the model has already seen. They test memory disguised as reasoning. ARC-AGI-3 tests something different: the ability to learn in real time, from experience, inside an environment the system has never seen before [3].

That distinction maps directly to the kinds of decisions product and technology teams face every day. Some work is pattern-based: categorising, translating, summarising, generating variations of known formats. AI is already good at this, and getting better fast. Other work requires meeting something genuinely new and making sense of it without a guide. A user problem that doesn't match any existing template. A technical decision where the data suggests two different directions. A product bet where the right answer depends on context nobody has written down.

The benchmark measures what most builders already feel: these are fundamentally different kinds of work, and strength in the first does not automatically transfer to the second.

The automation reflex

When organisations adopt AI, the instinct is to start with the question: "What can we automate?" It's a reasonable question. But if it's the only question, you risk removing the very capabilities that humans are uniquely good at. Not because you wanted to, but because those capabilities are harder to see on a spreadsheet. Nobody tracks "new situations handled well" as a KPI.

This is something we think about constantly at Strife. We're building a CMS platform with AI deeply integrated into the product, and into how we work as a team. Every week we face the same question: where does AI make the product better for our users, and where does it risk removing the judgment and creativity that makes the work meaningful? The answer is never obvious, and it keeps changing as the technology moves.

What we've learned is that the line between "automate this" and "protect this" isn't something you draw once. It's a design decision you revisit continuously, and it requires understanding the people on both sides: the users and the team.

The ARC-AGI-3 results suggest a useful starting point for that conversation.

Instead of asking what can be automated, ask where in your product or organisation the work is genuinely unpredictable. Where do people regularly encounter situations they haven't seen before and figure them out anyway? Where does judgment matter more than process?

Those are the places where human capability is essential. And they are the places most likely to be weakened when the default strategy is to add AI to everything.

Designing for the human layer

This isn't an argument against AI. The pattern-based work that AI handles well is real work, and offloading it creates genuine value. The argument is that the value of AI depends on understanding where it fits, and that understanding starts with taking human capability seriously.

The ARC benchmarks keep demonstrating the same pattern. Earlier versions saw rapid AI progress: scores on ARC-AGI-1 climbed from near zero to over 50% through test-time training techniques, and ARC-AGI-2 was won at 24% accuracy in its first competition year [4]. Then ARC-AGI-3 resets the score to below 1%. Every time the benchmark gets closer to measuring genuine adaptability, the gap between human and machine reappears.

That pattern is worth paying attention to when making product and technology decisions. The strongest AI strategy isn't one that automates the most. It's one that is precise about what to automate and what to protect. The best digital products aren't the ones with the most AI features. They're the ones that use AI to strengthen what humans already do well, while keeping human judgment involved where it matters.

The next time someone presents an AI roadmap for your product, it might be worth asking: which parts of this plan are designed around what the technology can do, and which parts are designed around what the people, your users and your team, can do?

The answer will tell you more about the strategy than any benchmark.

References

[1] ARC Prize Foundation, "Announcing ARC-AGI-3", March 25, 2026. arcprize.org

[2] ARC Prize Foundation, Human player data from 1,200+ players across 3,900+ games during the ARC-AGI-3 preview period. arcprize.org

[3] Sullivan, Mark, "This new benchmark could expose AI's biggest weakness", Fast Company, March 25, 2026. fastcompany.com

[4] ARC Prize Foundation, "ARC Prize 2025 Results and Analysis". arcprize.org

CMS

PIM

DAM

Strife Intelligence

PDF Builder

About

Docs

Terms

Privacy

DPA

Existing Customers

Free Trial Users

The ability we take for granted

What the benchmark actually measures

The automation reflex

Designing for the human layer