AI Keeps Passing Our Benchmarks. Our Solution Is to Move the Goal Post.

"We've run out of test questions to ask. Reality is the ultimate reasoning test."

-       Elon Musk at Grok 4 Launch, July 9, 2025

We take a moment to stop and reflect when people meet new educational goals. High school graduation, college diploma, PhD thesis these are major events in a person’s life. Many thought as artificial intelligence (AI) progressed, we would see similar weight given to milestones. This does not seem to be the case. When the latest AI agents pass a benchmark, we barely have time to evaluate how it did that before it passes the next one. We are giving no time to consider the implications of these accomplishments and what they mean in the broader scope of humanity. Instead of pumping the breaks to re-evaluate we appear to be pushing the gas pedal all the way to the floor. 

The pursuit of artificial general intelligence (AGI, machines that can reason and adapt like humans) has long been measured by benchmarks that evolve as AI advances, effectively moving the goalposts when old tests are conquered.

Early precursors set the stage: The Turing Test, proposed by Alan Turing in 1950, challenged machines to engage in text-based conversations indistinguishable from humans, encompassing natural language, deception, and context. Once deemed the gold standard, it was arguably first passed in 2014 by the Eugene Goostman chatbot, which fooled 33% of judges, though critics dismissed it as superficial. Modern LLMs like GPT-4 have since far exceeded this, rendering it obsolete.

As limitations in conversational mimicry became clear, benchmarks shifted toward deeper reasoning. The Winograd Schema Challenge tested commonsense via pronoun resolution, passed at human levels (~90%) by models like DeBERTa around 2021. Similarly, GLUE (2018) and SuperGLUE (2019) aggregated language tasks, with AI surpassing human baselines (87-89%) by 2019-2020 via BERT and T5. These paved the way for today's frontier tests like the American Invitational Mathematics Examination (AIME), Abstraction and Reasoning Corpus for AGI (ARC-AGI), and Humanity's Last Exam (HLE), which probe advanced math, abstraction, and multidisciplinary expertise.

AIME, launched in 1983 as a U.S. high school math contest, features 15 integer-answer problems in algebra, geometry, and beyond, adopted in the 2020s for AI to test rigorous reasoning (top human scores hover at 50-60%). Early AI like GPT-3 scored <20% in 2020-2021, with GPT-4 hitting ~52% in 2023 and Grok 4 achieving a perfect 100% in 2025.

ARC-AGI (2019) uses grid puzzles to evaluate generalization and core intelligence priors, with humans at ~85%; initial AI scores were 5-10% in 2019-2020, climbing to o1-preview's 13% in 2024 and o3's ~88% on variants in 2025, though the full test remains challenging.

Humanity's Last Exam (developed late 2024, finalized early 2025) poses 3,000+ PhD-level, multi-modal questions across fields like physics and ethics, designed to resist quick saturation (human experts score 60-70%, while o3 managed 24% and Grok 4 44% in 2025 evaluations). This progression from mimicry to profound problem-solving illustrates how benchmarks have continually adapted, as AI's feats push us toward real-world validation.

Timeline of AI Benchmark Development

1950: Turing Test proposed as the imitation game for conversational AI.

1983: AIME established as a math competition, later adapted for AI benchmarking.

2011: Winograd Schema Challenge formalized, building on 1970s ideas for commonsense testing.

2014: Turing Test arguably first passed by Eugene Goostman.

2018-2019: GLUE and SuperGLUE introduced for broad language evaluation, surpassed by AI within 1-2 years.

2019: ARC-AGI launched to assess abstraction and reasoning.

2020-2023: Modest AI gains on AIME and ARC-AGI; precursors like SuperGLUE fully passed.

Late 2024: Humanity's Last Exam development begins amid benchmark saturation concerns.

Early 2025: HLE finalized; major breakthroughs on ARC-AGI variants and AIME perfection by frontier models.

At Grok 4’s launch on July 9th it was announced the latest model has made even further progress than expected on our latest benchmarks. As we accelerate full speed ahead toward AGI we are running out of conventional exams to test their limits. These models will soon be tested on their ability to operate and reason in real world environments. As with human achievements, it may behoove us to slow down and recognize some of these milestones when they happen. If we take a moment to acknowledge these technological leaps we might foster a more thoughtful path to AGI, one that celebrates progress without sacrificing our shared future.

Previous
Previous

A Look at Alabama’s GenAI Task Force Report

Next
Next

The Algorithm Explained: What It Is, How It Works, and Whom It Affects.