Every Benchmark Launched 2023-2024 Has Fallen — The METR / SWE-Bench / CORE-Bench / MLE-Bench / PostTrainBench Sequence

📊 Full opportunity report: Every Benchmark Launched 2023-2024 Has Fallen — The METR / SWE-Bench / CORE-Bench / MLE-Bench / PostTrainBench Sequence on ThorstenMeyerAI.com — validation score, market gap, and execution plan.

TL;DR

Six key AI benchmarks launched between 2023 and 2024 have all saturated or are close to saturation within months. This pattern suggests rapid, structural progress in AI research capabilities, confirming faster-than-expected development trajectories.

All six major benchmarks designed to measure AI research and development capability launched between 2023 and 2024 have now saturated or are approaching saturation within months, according to recent analysis by Thorsten Meyer. This pattern indicates a rapid acceleration in AI capabilities, confirming that progress is occurring faster than many forecasts predicted.

Thorsten Meyer’s analysis, based on data from Jack Clark’s Import AI #455, highlights six benchmarks: SWE-Bench, METR Time Horizons, CORE-Bench, MLE-Bench, PostTrainBench, and CPU Speedup. Each was explicitly designed to challenge AI systems across different facets of research and engineering.

As of May 2026, all six benchmarks have either been declared solved, saturated, or are tracking toward saturation within a span of 15 to 30 months since their launch. For example, SWE-Bench, which measures real-world software engineering tasks, rose from 2% to 93.9% in 30 months, with the authors declaring it ‘saturated.’ Similarly, METR Time Horizons, tracking AI’s ability to perform research tasks over increasing durations, expanded from 30 seconds to 12 hours in four years, representing a 1,440-fold growth.

Other benchmarks, such as CORE-Bench and MLE-Bench, also reached near-complete saturation within 15-16 months, with the authors explicitly declaring some as ‘solved.’ The consistent pattern across all six indicates a structural trend rather than isolated improvements, with progress happening on a timeline of months rather than years.

Implications of Rapid Benchmark Saturation for AI Development

The saturation of these benchmarks within such short timeframes strongly suggests that AI research capabilities are advancing at an exponential pace. This pattern supports forecasts like Jack Clark’s 60% automation of AI R&D by 2028, as the benchmarks measure core skills needed for automating research and engineering tasks. It indicates that AI systems are rapidly approaching or surpassing human-level performance in key areas, which could accelerate deployment, innovation, and possibly reshape AI research workflows.

For policymakers, investors, and industry leaders, these developments imply that AI progress is not only faster than previously thought but also reaching a point where further improvements may be incremental rather than foundational. This could influence strategic planning, regulation, and investment decisions, emphasizing the importance of monitoring these benchmarks and their implications for AI capabilities.

Evals for AI Engineers: Systematically Measuring and Improving AI Applications

View Latest Price

As an affiliate, we earn on qualifying purchases.

Background on Benchmark Design and Previous Expectations

These six benchmarks were selected explicitly to challenge AI systems across different domains, including software engineering, research reproduction, machine learning engineering, and compute optimization. Launched between late 2023 and early 2024, they were intended to measure the progress of AI in performing complex, real-world tasks that are critical for autonomous research and development.

Prior to this saturation pattern, forecasts of AI capability growth varied, with some experts expecting gradual improvements over several years. The pattern observed in these benchmarks, however, reveals a much faster trajectory, driven by advances in model architectures, training techniques, and compute efficiency. Jack Clark’s analysis underscores that each benchmark’s rapid saturation is unlikely to be coincidental, representing a structural shift rather than isolated gains.

“The pattern across all six benchmarks indicates a rapid, structural acceleration in AI research capabilities, with each reaching or nearing saturation within months.”
— Thorsten Meyer

Uncertainties About Long-Term Impact and Future Trajectories

While the saturation of these benchmarks indicates rapid progress, it remains unclear how this will translate into broader AI deployment, real-world applications, or further breakthroughs. It is also uncertain whether new benchmarks will emerge that challenge AI systems at higher levels or if current saturation points will hold as models continue to evolve.

Additionally, some experts caution that saturation on benchmarks may reflect overfitting or measurement noise, though the consistent pattern across six diverse tests suggests this is less likely. The long-term impact on AI safety, regulation, and societal integration remains to be seen, as does the potential for diminishing returns beyond current saturation points.

Next Steps for Monitoring AI Capability Progress

Researchers and industry stakeholders will need to track the development of new benchmarks and evaluate whether current saturation levels persist as models evolve. Attention should also focus on how these rapid capability gains translate into practical AI deployment, including assessing risks, safety, and regulatory implications.

Further studies are expected to analyze whether these saturation patterns continue across other domains and whether new challenges emerge that can push AI capabilities beyond current limits. Policymakers and investors should prepare for a landscape where AI systems are increasingly capable of autonomous research and engineering tasks, potentially accelerating innovation cycles.

Key Questions

What do the saturation of these benchmarks mean for AI safety?

Saturation indicates rapid capability growth, which could lead to AI systems performing at or above human levels in key tasks. This raises questions about safety, control, and alignment, requiring ongoing monitoring and regulation.

Are these benchmarks representative of real-world AI applications?

Yes, many benchmarks are designed to simulate or directly measure real-world tasks, such as software engineering and research reproduction, making their saturation a strong indicator of practical AI capabilities.

Will new benchmarks emerge to challenge AI systems further?

It is likely that researchers will develop more complex benchmarks to push AI beyond current saturation levels, especially as models continue to improve rapidly.

How soon could AI systems automate most R&D tasks?

Based on current saturation trends, experts like Jack Clark forecast around 60% automation of AI R&D by 2028, but this depends on continued progress and new challenge benchmarks.

What are the risks of rapid AI capability saturation?

Rapid saturation could lead to deployment of highly capable AI systems without sufficient safety measures, raising concerns about misuse, unintended consequences, and regulatory gaps.

Source: ThorstenMeyerAI.com

Every Benchmark Launched 2023-2024 Has Fallen — The METR / SWE-Bench / CORE-Bench / MLE-Bench / PostTrainBench Sequence

Up next

732 Bytes to Root. One Hour of Scan Time.

Author

Get an Insight Team

Share article

Implications of Rapid Benchmark Saturation for AI Development

Evals for AI Engineers: Systematically Measuring and Improving AI Applications

Background on Benchmark Design and Previous Expectations

Uncertainties About Long-Term Impact and Future Trajectories

Next Steps for Monitoring AI Capability Progress

Key Questions

What do the saturation of these benchmarks mean for AI safety?

Are these benchmarks representative of real-world AI applications?

Will new benchmarks emerge to challenge AI systems further?

How soon could AI systems automate most R&D tasks?

What are the risks of rapid AI capability saturation?

Three Public Vulnerabilities. Chained.

Your Coding Agent Is an Attack Surface: The Claude Code Security Reckoning

Recovery-percentile tracker for orthopedic surgery patients

What Viral Posts Got Wrong About Baidu’s AI OCR Technology

Claude Hacked 3 Companies — The Sandbox’s Lies Are Now Clear

Lifestyle, Not Location, Is Reshaping Chennai’s Homebuying Trends: Navin Kumar, Managing Director Of Navin’s

How To License Your Voice For AI Clones Securely

Alliance Scale And AI: A New Era Of Defense Challenges

Every Benchmark Launched 2023-2024 Has Fallen — The METR / SWE-Bench / CORE-Bench / MLE-Bench / PostTrainBench Sequence

Up next

Author

Get an Insight Team

Share article

Implications of Rapid Benchmark Saturation for AI Development

Evals for AI Engineers: Systematically Measuring and Improving AI Applications

Background on Benchmark Design and Previous Expectations

Uncertainties About Long-Term Impact and Future Trajectories

Next Steps for Monitoring AI Capability Progress

Key Questions

What do the saturation of these benchmarks mean for AI safety?

Are these benchmarks representative of real-world AI applications?

Will new benchmarks emerge to challenge AI systems further?

How soon could AI systems automate most R&D tasks?

What are the risks of rapid AI capability saturation?

You May Also Like