Topic

LLM Benchmark & Performance Claims

Every major AI release comes with bold benchmark claims — "state of the art," "surpasses GPT-4," "human-level reasoning." But how reliable are these numbers? This page tracks 17 community-assessed claims about model performance, benchmark scores, and capability evaluations, each backed by traceable source links.

From MMLU and HumanEval to custom coding and math benchmarks, every claim is scored on attribution accuracy (did the company really claim these numbers?) and veracity (do independent evaluations support the claim?). Inspect the evidence chain and decide for yourself.

17benchmark claims tracked

27source-backed evidence entries

2independent scoring dimensions

Google announced Gemini 1.5 with a long-context capability in 2024

Evidence suggests supported

Attribution: 94%Veracity: 81%

1 evidence entries →

OpenAI released GPT-5.4 across ChatGPT, the API, and Codex

Evidence suggests supported

Attribution: 96%Veracity: 82%

1 evidence entries →

Current deepfake detection tools fail to identify AI-generated content more than 30% of the time

Evidence inconclusive

Attribution: 60%Veracity: 55%

1 evidence entries →

technology

AI coding assistants now generate approximately 40% of new code at some major tech companies

Evidence suggests supported

Attribution: 85%Veracity: 75%

1 evidence entries →

Open-source AI models have closed the capability gap with proprietary models to within 5%

Evidence mixed

Attribution: 90%Veracity: 50%

2 evidence entries →

technology

Google claimed its Willow quantum chip solved a computation in under 5 minutes that would take classical supercomputers 10 septillion years

Partially supported

Attribution: 95%Veracity: 65%

2 evidence entries →

technology

SpaceX achieved the first successful controlled reentry and splashdown of Starship's upper stage in June 2024

Strongly supported

Attribution: 95%Veracity: 85%

2 evidence entries →

Anthropic published the first Responsible Scaling Policy committing to capability evaluations before training more powerful AI

Partially supported

Attribution: 95%Veracity: 65%

2 evidence entries →

Major AI models from OpenAI, Anthropic, and Google are becoming less capable while prices increase

Evidence inconclusive

Attribution: 68%Veracity: 45%

2 evidence entries →

OpenAI claims its AI model autonomously disproved the 80-year-old Erdős unit distance conjecture

Under review

Attribution: 90%Veracity: 55%

1 evidence entries →

Microsoft's AI chief predicted human-level performance on professional computer tasks within 12–18 months

Speculative

Attribution: 82%Veracity: 40%

2 evidence entries →

technology

Reports indicate Meta tracked internal employee activity through its 'Model Capability Initiative' to train AI models

Disputed

Attribution: 68%Veracity: 52%

2 evidence entries →

Anthropic's valuation could surpass $900 billion, making it the world's most valuable private startup

Partially supported — round reportedly still being finalized

Attribution: 78%Veracity: 55%

2 evidence entries →

TELUS Digital reports every AI model tested was exploitable in 620,000+ adversarial attack benchmark across 34 models

Evidence suggests supported

Attribution: 90%Veracity: 65%

1 evidence entries →

Major Chinese AI platforms collectively restrict exam-assistance features ahead of 2026 gaokao

Evidence suggests supported

Attribution: 85%Veracity: 82%

1 evidence entries →

Anthropic announced ten Claude agent templates for financial-services work

Evidence suggests announced

Attribution: 95%Veracity: 68%

2 evidence entries →

Meta announced Muse Spark and said it powers Meta AI experiences

Evidence suggests announced

Attribution: 95%Veracity: 64%

2 evidence entries →

Source-backed evidence

Preparing source-backed claims

LLM Benchmark & Performance Claims

Google announced Gemini 1.5 with a long-context capability in 2024

OpenAI released GPT-5.4 across ChatGPT, the API, and Codex

Current deepfake detection tools fail to identify AI-generated content more than 30% of the time

AI coding assistants now generate approximately 40% of new code at some major tech companies

Open-source AI models have closed the capability gap with proprietary models to within 5%

Google claimed its Willow quantum chip solved a computation in under 5 minutes that would take classical supercomputers 10 septillion years

SpaceX achieved the first successful controlled reentry and splashdown of Starship's upper stage in June 2024

Anthropic published the first Responsible Scaling Policy committing to capability evaluations before training more powerful AI

Major AI models from OpenAI, Anthropic, and Google are becoming less capable while prices increase

OpenAI claims its AI model autonomously disproved the 80-year-old Erdős unit distance conjecture

Microsoft's AI chief predicted human-level performance on professional computer tasks within 12–18 months

Reports indicate Meta tracked internal employee activity through its 'Model Capability Initiative' to train AI models

Anthropic's valuation could surpass $900 billion, making it the world's most valuable private startup

TELUS Digital reports every AI model tested was exploitable in 620,000+ adversarial attack benchmark across 34 models

Major Chinese AI platforms collectively restrict exam-assistance features ahead of 2026 gaokao

Anthropic announced ten Claude agent templates for financial-services work

Meta announced Muse Spark and said it powers Meta AI experiences

Spotted a dubious benchmark claim?

Preparing source-backed claims

LLM Benchmark & Performance Claims

Google announced Gemini 1.5 with a long-context capability in 2024

OpenAI released GPT-5.4 across ChatGPT, the API, and Codex

Current deepfake detection tools fail to identify AI-generated content more than 30% of the time

AI coding assistants now generate approximately 40% of new code at some major tech companies

Open-source AI models have closed the capability gap with proprietary models to within 5%

Google claimed its Willow quantum chip solved a computation in under 5 minutes that would take classical supercomputers 10 septillion years

SpaceX achieved the first successful controlled reentry and splashdown of Starship's upper stage in June 2024

Anthropic published the first Responsible Scaling Policy committing to capability evaluations before training more powerful AI

Major AI models from OpenAI, Anthropic, and Google are becoming less capable while prices increase

OpenAI claims its AI model autonomously disproved the 80-year-old Erdős unit distance conjecture

Microsoft's AI chief predicted human-level performance on professional computer tasks within 12–18 months

Reports indicate Meta tracked internal employee activity through its 'Model Capability Initiative' to train AI models

Anthropic's valuation could surpass $900 billion, making it the world's most valuable private startup

TELUS Digital reports every AI model tested was exploitable in 620,000+ adversarial attack benchmark across 34 models

Major Chinese AI platforms collectively restrict exam-assistance features ahead of 2026 gaokao

Anthropic announced ten Claude agent templates for financial-services work

Meta announced Muse Spark and said it powers Meta AI experiences

Spotted a dubious benchmark claim?