Ai Benchmarks for Code

AI Coding Agents Write 180% More Code But Ship Only 30% More Software

AI coding agents boost code output by 180% but shipping rises only 30%, MIT finds. Why private data access beats benchmark ...

Gomboc AI Publishes First Open Benchmark for AI Code Remediation

15 cloud scenarios. 43 merge-ready fixes. 100% loop closure. 12 minutes and $17 to author once; seconds and zero-cost ...

SD Times

Beyond Benchmarks: Measuring the True Cost of AI-Generated Code

Value stream management involves people in the organization to examine workflows and other processes to ensure they are deriving the maximum value from their efforts while eliminating waste — of ...

21h

Xiaomi’s New AI Coding Agent Beats Claude Code and is Now Completely Free to Use With a Major Advantage Over Rivals

Xiaomi has open-sourced MiMo Code V0.1.0, a new terminal-based AI coding assistant built for long-running software projects.

15d

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

DeepSWE puts GPT-5.5 atop the AI coding leaderboard while raising new questions about Claude Opus, SWE-Bench Pro, and ...

22h

How Anthropic’s Fable 5 Beat ChatGPT 5.5 by 20% in Coding Benchmarks

Anthropic has launched Claude Fable 5, a Mythos-class AI model that outperforms GPT 5.5 in coding and vision tasks despite ...

Hosted on MSN

What AI coding benchmarks still miss about software quality

Most AI coding benchmarks still ask the question: did the agent produce code that passes the current tests? This is a useful question, but it is too narrow. Software development is iterative.

16h

Datadog evolves from observability platform into control plane for AI systems: Benchmark

Datadog's transformation from an observability platform into a control network for AI-driven systems and autonomous ...

InfoWorld

Why benchmarks are key to AI progress

Researchers are racing to develop more challenging, interpretable, and fair assessments of AI models that reflect real-world use cases. The stakes are high. Benchmarks are often reduced to leaderboard ...

Forbes

AI Models Still Struggle With Reasoning — And Here’s Why

Forbes contributors publish independent expert analyses and insights. I write about the economics of AI. What looks like intelligence in AI models may just be memorization. A closer look at benchmarks ...

28d

Microsoft’s multi-agent AI system tops Anthropic’s Mythos on cybersecurity benchmark

Microsoft's new vulnerability-scanning system, codenamed MDASH, scored 88.45% on the CyberGym benchmark, surpassing ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results