📊 Full opportunity report: Every Benchmark Launched 2023-2024 Has Fallen — The METR / SWE-Bench / CORE-Bench / MLE-Bench / PostTrainBench Sequence on ThorstenMeyerAI.com — validation score, market gap, and execution plan.

TL;DR

Six key AI benchmarks launched between 2023 and 2024 have all saturated or are close to saturation within months. This pattern suggests rapid, structural progress in AI research capabilities, confirming faster-than-expected development trajectories.

All six major benchmarks designed to measure AI research and development capability launched between 2023 and 2024 have now saturated or are approaching saturation within months, according to recent analysis by Thorsten Meyer. This pattern indicates a rapid acceleration in AI capabilities, confirming that progress is occurring faster than many forecasts predicted.

Thorsten Meyer’s analysis, based on data from Jack Clark’s Import AI #455, highlights six benchmarks: SWE-Bench, METR Time Horizons, CORE-Bench, MLE-Bench, PostTrainBench, and CPU Speedup. Each was explicitly designed to challenge AI systems across different facets of research and engineering.

As of May 2026, all six benchmarks have either been declared solved, saturated, or are tracking toward saturation within a span of 15 to 30 months since their launch. For example, SWE-Bench, which measures real-world software engineering tasks, rose from 2% to 93.9% in 30 months, with the authors declaring it ‘saturated.’ Similarly, METR Time Horizons, tracking AI’s ability to perform research tasks over increasing durations, expanded from 30 seconds to 12 hours in four years, representing a 1,440-fold growth.

Other benchmarks, such as CORE-Bench and MLE-Bench, also reached near-complete saturation within 15-16 months, with the authors explicitly declaring some as ‘solved.’ The consistent pattern across all six indicates a structural trend rather than isolated improvements, with progress happening on a timeline of months rather than years.

Implications of Rapid Benchmark Saturation for AI Development

The saturation of these benchmarks within such short timeframes strongly suggests that AI research capabilities are advancing at an exponential pace. This pattern supports forecasts like Jack Clark’s 60% automation of AI R&D by 2028, as the benchmarks measure core skills needed for automating research and engineering tasks. It indicates that AI systems are rapidly approaching or surpassing human-level performance in key areas, which could accelerate deployment, innovation, and possibly reshape AI research workflows.

For policymakers, investors, and industry leaders, these developments imply that AI progress is not only faster than previously thought but also reaching a point where further improvements may be incremental rather than foundational. This could influence strategic planning, regulation, and investment decisions, emphasizing the importance of monitoring these benchmarks and their implications for AI capabilities.

Evals for AI Engineers: Systematically Measuring and Improving AI Applications

Evals for AI Engineers: Systematically Measuring and Improving AI Applications

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background on Benchmark Design and Previous Expectations

These six benchmarks were selected explicitly to challenge AI systems across different domains, including software engineering, research reproduction, machine learning engineering, and compute optimization. Launched between late 2023 and early 2024, they were intended to measure the progress of AI in performing complex, real-world tasks that are critical for autonomous research and development.

Prior to this saturation pattern, forecasts of AI capability growth varied, with some experts expecting gradual improvements over several years. The pattern observed in these benchmarks, however, reveals a much faster trajectory, driven by advances in model architectures, training techniques, and compute efficiency. Jack Clark’s analysis underscores that each benchmark’s rapid saturation is unlikely to be coincidental, representing a structural shift rather than isolated gains.

“The pattern across all six benchmarks indicates a rapid, structural acceleration in AI research capabilities, with each reaching or nearing saturation within months.”

— Thorsten Meyer

Industrial Test Systems Quick 481396-W Arsenic Wood Field Testing Kit, 5 Tests, 12 Minutes Test Time

Industrial Test Systems Quick 481396-W Arsenic Wood Field Testing Kit, 5 Tests, 12 Minutes Test Time

✔DETECTION LEVELS: Arsenic 0, 5, 10, 20, 40, 50, 60, 70, 80, 90, 100, 120, 170, >250, >400,…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Uncertainties About Long-Term Impact and Future Trajectories

While the saturation of these benchmarks indicates rapid progress, it remains unclear how this will translate into broader AI deployment, real-world applications, or further breakthroughs. It is also uncertain whether new benchmarks will emerge that challenge AI systems at higher levels or if current saturation points will hold as models continue to evolve.

Additionally, some experts caution that saturation on benchmarks may reflect overfitting or measurement noise, though the consistent pattern across six diverse tests suggests this is less likely. The long-term impact on AI safety, regulation, and societal integration remains to be seen, as does the potential for diminishing returns beyond current saturation points.

Machine Learning with R: Learn techniques for building and improving machine learning models, from data preparation to model tuning, evaluation, and working with big data

Machine Learning with R: Learn techniques for building and improving machine learning models, from data preparation to model tuning, evaluation, and working with big data

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Next Steps for Monitoring AI Capability Progress

Researchers and industry stakeholders will need to track the development of new benchmarks and evaluate whether current saturation levels persist as models evolve. Attention should also focus on how these rapid capability gains translate into practical AI deployment, including assessing risks, safety, and regulatory implications.

Further studies are expected to analyze whether these saturation patterns continue across other domains and whether new challenges emerge that can push AI capabilities beyond current limits. Policymakers and investors should prepare for a landscape where AI systems are increasingly capable of autonomous research and engineering tasks, potentially accelerating innovation cycles.

Compiler Engineering for AI Hardware: MLIR, TVM, XLA, and Custom Backends for Neural Network Accelerators (AI Infrastructure, Hardware & Compiler Engineering Series)

Compiler Engineering for AI Hardware: MLIR, TVM, XLA, and Custom Backends for Neural Network Accelerators (AI Infrastructure, Hardware & Compiler Engineering Series)

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

What do the saturation of these benchmarks mean for AI safety?

Saturation indicates rapid capability growth, which could lead to AI systems performing at or above human levels in key tasks. This raises questions about safety, control, and alignment, requiring ongoing monitoring and regulation.

Are these benchmarks representative of real-world AI applications?

Yes, many benchmarks are designed to simulate or directly measure real-world tasks, such as software engineering and research reproduction, making their saturation a strong indicator of practical AI capabilities.

Will new benchmarks emerge to challenge AI systems further?

It is likely that researchers will develop more complex benchmarks to push AI beyond current saturation levels, especially as models continue to improve rapidly.

How soon could AI systems automate most R&D tasks?

Based on current saturation trends, experts like Jack Clark forecast around 60% automation of AI R&D by 2028, but this depends on continued progress and new challenge benchmarks.

What are the risks of rapid AI capability saturation?

Rapid saturation could lead to deployment of highly capable AI systems without sufficient safety measures, raising concerns about misuse, unintended consequences, and regulatory gaps.

Source: ThorstenMeyerAI.com

You May Also Like

The Roblox Cheat That Broke Vercel.

A Roblox auto-farm script downloaded by an employee led to a two-month breach of Vercel, exposing customer credentials across multiple cloud platforms.

The Agent Trap: Why 90% of AI “Launches” Are Infrastructure Liars

Most AI ‘agent’ launches in 2026 are features on vendor infrastructure, not real autonomous agents. This report explains why it matters and what remains unclear.

Rogue One: The Andor Cut — On Fan Editing as Tonal Reverse-Engineering

A fan edit reimagines Rogue One as if made after Andor, emphasizing tonal consistency and deepening emotional context. Details remain emerging.

The Machine Economy — Capital-Heavy, Human-Light, Trading With Itself

Analysis of the emerging machine economy where AI-driven firms operate with minimal human involvement, reshaping markets and economic structures.