Every major AI release comes with bold benchmark claims — "state of the art," "surpasses GPT-4," "human-level reasoning." But how reliable are these numbers? This page tracks 17 community-assessed claims about model performance, benchmark scores, and capability evaluations, each backed by traceable source links.
From MMLU and HumanEval to custom coding and math benchmarks, every claim is scored on attribution accuracy (did the company really claim these numbers?) and veracity (do independent evaluations support the claim?). Inspect the evidence chain and decide for yourself.
aiGoogle announced Gemini 1.5 with a long-context capability in 2024
Evidence suggests supported
Attribution: 94%Veracity: 81%
1 evidence entries →aiOpenAI released GPT-5.4 across ChatGPT, the API, and Codex
Evidence suggests supported
Attribution: 96%Veracity: 82%
1 evidence entries →aiCurrent deepfake detection tools fail to identify AI-generated content more than 30% of the time
Evidence inconclusive
Attribution: 60%Veracity: 55%
1 evidence entries →technologyAI coding assistants now generate approximately 40% of new code at some major tech companies
Evidence suggests supported
Attribution: 85%Veracity: 75%
1 evidence entries →aiOpen-source AI models have closed the capability gap with proprietary models to within 5%
Evidence mixed
Attribution: 90%Veracity: 50%
2 evidence entries →technologyGoogle claimed its Willow quantum chip solved a computation in under 5 minutes that would take classical supercomputers 10 septillion years
Partially supported
Attribution: 95%Veracity: 65%
2 evidence entries →technologySpaceX achieved the first successful controlled reentry and splashdown of Starship's upper stage in June 2024
Strongly supported
Attribution: 95%Veracity: 85%
2 evidence entries →aiAnthropic published the first Responsible Scaling Policy committing to capability evaluations before training more powerful AI
Partially supported
Attribution: 95%Veracity: 65%
2 evidence entries →aiMajor AI models from OpenAI, Anthropic, and Google are becoming less capable while prices increase
Evidence inconclusive
Attribution: 68%Veracity: 45%
2 evidence entries →aiOpenAI claims its AI model autonomously disproved the 80-year-old Erdős unit distance conjecture
Under review
Attribution: 90%Veracity: 55%
1 evidence entries →aiMicrosoft's AI chief predicted human-level performance on professional computer tasks within 12–18 months
Speculative
Attribution: 82%Veracity: 40%
2 evidence entries →technologyReports indicate Meta tracked internal employee activity through its 'Model Capability Initiative' to train AI models
Disputed
Attribution: 68%Veracity: 52%
2 evidence entries →aiAnthropic's valuation could surpass $900 billion, making it the world's most valuable private startup
Partially supported — round reportedly still being finalized
Attribution: 78%Veracity: 55%
2 evidence entries →aiTELUS Digital reports every AI model tested was exploitable in 620,000+ adversarial attack benchmark across 34 models
Evidence suggests supported
Attribution: 90%Veracity: 65%
1 evidence entries →aiMajor Chinese AI platforms collectively restrict exam-assistance features ahead of 2026 gaokao
Evidence suggests supported
Attribution: 85%Veracity: 82%
1 evidence entries →aiAnthropic announced ten Claude agent templates for financial-services work
Evidence suggests announced
Attribution: 95%Veracity: 68%
2 evidence entries →aiMeta announced Muse Spark and said it powers Meta AI experiences
Evidence suggests announced
Attribution: 95%Veracity: 64%
2 evidence entries →