ÖNCÜ LLM KARŞILAŞTIRMA · 1 MAYIS 2026

Modeller Arası Başarım: Claude · GPT · Gemini · Kimi · GLM · Qwen · DeepSeek · Grok · Mythos

Her satırın en yüksek puanı öne çıkarıldı. Boş hücreler veri yayınlanmadığını gösterir.
KAZANÇ SAYISI — Mythos hariç (yayımlanmadığı için adil karşılaştırma)
9
Gemini 3.1 Pro
8
Opus 4.7
6
Opus 4.6
5
GPT-5.5
5
GPT-5.4
3
DeepSeek V4
1
Gemini 3 Pro
ANTHROPIC OPENAI GOOGLE MOONSHOT Z.AI ALIBABA DEEPSEEK xAI
Mythos
Preview(Nis 26 önizleme)
Opus
4.7(16 Nis 26)
Opus
4.6(5 Şub 26)
Opus
4.5(24 Kas 25)
Sonnet
4.6(17 Şub 26)
GPT
5.5(23 Nis 26)
GPT
5.4(5 Mar 26)
Gemini
3.1 Pro(19 Şub 26)
Gemini
3 Pro(18 Kas 25)
Kimi
K2.6(13 Nis 26)
GLM
5.1(27 Mar 26)
Qwen 3.6
Plus(2 Nis 26)
DeepSeek
V4-Pro(24 Nis 26)
DeepSeek
V3.2(1 Ara 25)
DeepSeek
V3(26 Ara 24)
Grok
4.20(17 Şub 26)
Grok
4(Tem 25)
YAZILIM GELİŞTİRME / KODLAMA
SWE-bench Verified 93.9% 87.6% 80.8% 80.9% 79.6% 77.2% 80.6% 76.2% 78.8% 80.6% ~74% 42.0% 72.0%
SWE-bench Pro 77.8% 64.3% 53.4% 58.6% 57.7% 54.2% 43.3% 56.6% 55.4%
SWE-bench Multilingual 87.3% 77.5% 73.8% 76.2%
SWE-bench Multimodal 59.0% 27.1%
Terminal-Bench 2.0 82.0% 69.4% 65.4% 59.3% 59.1% 82.7% 75.1% 68.5% 56.9% 61.6% 67.9% 46.4%
LiveCodeBench 88.8% 91.7% 87.1% 93.5% 49.2% 40.5% 79.0%
Codeforces (Elo) 3168 3052 3206
GENEL BİLGİ & AKIL YÜRÜTME
MMLU-Pro 89.1% 87.5% 91.0% 88.5% 87.5% 81.2% 75.9%
GPQA Diamond 94.6% 94.2% 91.3% 87.0% 89.9% 93.6% 94.4% 94.3% 91.9% 90.4% 90.1% 82.4% 59.1% 87.5%
Humanity's Last Exam 56.8% 46.9% 40.0% 33.2% 41.4% 42.7% 44.4% 37.5% 28.8% 37.7% 25.0%
SimpleQA-Verified 46.2% 45.3% 75.6% 57.9% 24.9%
Chinese-SimpleQA 76.2% 76.8% 85.9% 84.4% 64.8%
ARC-AGI-1 92.0% 94.0% 80.0% 86.5% 95.0% 93.7% 98.0% 75.0% 57.0% 89.5% 66.6%
ARC-AGI-2 75.8% 68.8% 37.6% 58.3% 85.0% 74.0% 77.1% 31.1% 4.0% 65.1% 15.9%
ARC-AGI-3 0.5% 0.2% 0.4% 0.1%
MMMLU (Çok Dilli) 91.5% 91.1% 90.8% 89.3% 92.6% 91.8% 89.5%
MMMU-Pro (Çok Modal) 73.9% 74.5% 81.2% 81.2% 80.5% 81.0%
SciCode 52.0% 47.0% 59.0% 56.0%
CharXiv Reasoning (araçsız) 86.1% 82.1% 69.1%
CharXiv Reasoning (araçlı) 93.2% 91.0% 84.7%
MATEMATİK
AIME 2025 99.8% 98.1% 95.0% 96.0% 91.7%
AIME 2026 95.8% 96.7% 95.1% 97.5% 99.2% 98.3% 91.7% 95.8% 95.8% 95.3% 95.8% 94.2%
USAMO 2026 47.0% 95.2% 74.4%
HMMT 2026 Feb 96.2% 97.7% 94.7% 87.8% 95.2%
IMOAnswerBench 75.3% 91.4% 81.0% 83.8% 89.8%
Apex 34.5% 54.1% 60.9% 18.4% 38.3%
Apex Shortlist 85.9% 78.1% 89.1% 90.2%
UZUN BAĞLAM (1M Token)
MRCR 1M 92.9% 76.3% 83.5%
CorpusQA 1M 71.7% 53.8% 62.0%
EYLEMCİ / BİLGİSAYAR KULLANIMI
BrowseComp 86.9% 79.3% 83.7% 74.7% 84.4% 82.7% 85.9% 59.2% 83.4%
OSWorld-Verified 79.6% 78.0% 72.7% 66.3% 72.5% 78.7% 75.0%
HLE (araçlı) 64.7% 54.7% 53.3% 49.0% 52.2% 58.7% 51.4% 50.6% 48.2% 44.4%
GDPval-AA (Elo) 1619 1633 1674 1317 1195 1554
MCP-Atlas 77.3% 75.8% 62.3% 61.3% 75.3% 68.1% 69.2% 54.1% 74.1% 73.6%
Toolathlon 47.2% 55.6% 54.6% 48.8% 51.8%
Finance Agent v1.1 64.4% 60.1% 60.0% 61.5% 59.7%
τ2-bench (Retail) 91.9% 88.9% 91.7% 90.8% 85.3%
τ2-bench (Telecom) 99.3% 98.2% 97.9% 98.0% 98.9% 99.3% 98.0%
Vending-Bench 2 ($) $10.937 $8.018 $4.967 $7.204 $7.524 $6.144 $911 $5.478 $6.205 $5.634 $5.115 $3.285 $1.034 $4.663
SİBER GÜVENLİK
CyberGym 83.1% 73.1% 73.8% 81.8% 66.3%