ÖNCÜ LLM KARŞILAŞTIRMA · 9 HAZİRAN 2026

Modeller Arası Başarım: Claude · GPT · Gemini · Kimi · GLM · Qwen · DeepSeek · MiniMax · Grok · Fable

Her satır kendi içinde renklendirildi: yeşil = yüksek, sarı = orta, kırmızı = düşük. Birinci en yeşil + beyaz çerçeveli. Boş hücreler veri yayınlanmadığını gösterir.
düşük yüksek · birinci beyaz çerçeveli
KAZANÇ SAYISI — En yüksek skoru aldığı benchmark sayısı (Mythos Preview referans, hariç)
29
Fable 5
8
Gemini 3.1 Pro
7
Opus 4.8
4
GPT-5.4
4
Opus 4.6
4
Gemini 3.5 Flash
3
GPT-5.5
3
DeepSeek V4
ANTHROPIC OPENAI GOOGLE MOONSHOT Z.AI ALIBABA DEEPSEEK MINIMAX xAI
Fable
5(Haz 26)
Mythos
Preview(Nis 26)
Opus
4.8(May 26)
Opus
4.7(Nis 26)
Opus
4.6(Şub 26)
Opus
4.5(Kas 25)
Sonnet
4.6(Şub 26)
GPT
5.5(Nis 26)
GPT
5.4(Mar 26)
Gemini
3.5 Flash(May 26)
Gemini
3.1 Pro(Şub 26)
Gemini
3 Pro(Kas 25)
Kimi
K2.6(Nis 26)
GLM
5.2(Haz 26)
GLM
5.1(Mar 26)
Qwen
3.7 Max(May 26)
Qwen
3.6 Plus(Nis 26)
V4-Pro(Nis 26) V3.2(Ara 25) V3(Ara 24) M3(Haz 26) Grok
4.3(Nis 26)
Grok
4.20(Şub 26)
Grok
4(Tem 25)
YAZILIM GELİŞTİRME / KODLAMA
SWE-bench Verified 95.5% 93.9% 88.6% 87.6% 80.8% 80.9% 79.6% 77.2% 80.6% 76.2% 78.8% 80.6% 42.0% 80.5% 72.0%
SWE-bench Pro 80.3% 77.8% 69.2% 64.3% 53.4% 58.6% 57.7% 55.1% 54.2% 43.3% 62.1% 56.6% 55.4% 59.0%
SWE-bench Multilingual 87.3% 84.4% 80.5% 77.5% 73.8% 76.2%
SWE-bench Multimodal 59.0% 38.4% 34.5% 27.1%
Terminal-Bench 2.0 82.0% 69.4% 65.4% 59.3% 59.1% 82.7% 75.1% 68.5% 56.9% 61.6% 67.9% 46.4%
Terminal-Bench 2.1 88.0% 82.7% 66.1% 83.4% 76.2% 70.7% 66.0%
Terminal-Bench Hard 62.9% 58.3% 51.5% 46.2% 47.0% 53.0% 60.6% 57.6% 40.9% 53.8% 41.7% 43.9% 50.8% 43.2% 50.8% 43.9% 46.2% 35.6% 6.8% 42.4% 37.9% 37.9% 37.9%
FrontierCode (Diamond) 29.3% 13.4% 5.7%
LiveCodeBench 88.8% 91.7% 87.1% 93.5% 49.2% 40.5% 79.0%
Codeforces (Elo) 3168 3052 3206
GENEL BİLGİ & AKIL YÜRÜTME
▸ Fen & Bilim
GPQA Diamond 92.6% 92.0% 91.4% 89.6% 86.6% 87.5% 93.5% 92.0% 92.2% 94.1% 90.8% 91.1% 89.5%86.8% 92.3% 88.2% 88.8% 84.0% 55.7% 92.9% 90.1% 91.1% 87.7%
CritPt 28.6% 20.9% 12.0% 12.6% 4.6% 3.1% 27.1% 23.4% 13.1% 17.7% 9.1% 8.0% 20.9%4.6% 13.4% 2.9% 12.9% 2.9% 0.0% 3.7% 8.0% 6.6% 2.0%
SciCode 60.2% 53.5% 54.5% 51.9% 49.5% 46.8% 56.1% 56.6% 53.1% 58.9% 56.1% 53.5% 50.5%43.8% 48.8% 40.7% 50.0% 38.9% 35.4% 45.4% 47.3% 45.6% 45.7%
Humanity's Last Exam 53.3% 45.7% 39.6% 36.7% 28.4% 30.0% 44.3% 41.6% 41.0% 44.7% 37.2% 35.9% 40.1%28.0% 38.1% 25.7% 35.9% 22.2% 3.6% 37.1% 35.0% 32.2% 23.9%
Blueprint-Bench 2 38.6% 14.5% 24.5% 6.7% 36.2% 33.6% 26.5%
▸ Sağlık & Biyomedikal
HealthBench 62.7% 61.1% 59.3% 56.5%
HealthBench Professional 66.0% 64.7% 56.9% 51.8%
BioMysteryBench (insan) 83.9% 82.6% 80.4%
BioMysteryBench (zor) 46.1% 29.6% 40.0%
▸ Bilgi & Gerçeklik
MMLU-Pro 89.1% 87.5% 91.0% 88.5% 87.5% 81.2% 75.9%
MMMLU (Çok Dilli) 92.7% 91.5% 91.1% 90.8% 89.3% 92.6% 91.8% 89.5%
SimpleQA-Verified 46.2% 45.3% 75.6% 57.9% 24.9%
Chinese-SimpleQA 76.2% 76.8% 85.9% 84.4% 64.8%
IFBench 63.5% 62.2% 58.6% 53.1% 58.0% 56.6% 75.9% 73.9% 76.3% 77.1% 70.4% 76.0% 73.3% 76.3% 80.5% 75.2% 76.5% 60.7% 34.8% 82.9% 81.3% 81.2% 53.7%
▸ Soyut Akıl Yürütme (ARC-AGI)
ARC-AGI-1 92.0% 92.0% 93.0% 80.0% 86.0% 95.0% 93.7% 92.5% 98.0% 75.0% 57.0% 89.5% 66.6%
ARC-AGI-2 72.1% 75.8% 68.8% 37.6% 58.3% 85.0% 74.0% 72.1% 77.1% 31.1% 4.0% 65.1% 15.9%
ARC-AGI-3 1.5% 0.2% 0.5% 0.4% 0.2% 0.4% 0.1%
▸ Çok Biçimli & Görsel
MMMU-Pro (Çok Modal) 73.9% 74.5% 81.2% 81.2% 83.6% 80.5% 81.0% 78.1%
CharXiv Reasoning (araçsız) 88.9% 86.1% 82.1% 69.1% 84.2%
CharXiv Reasoning (araçlı) 93.5% 93.2% 91.0% 84.7%
ChartQAPro (araçsız) 69.4% 67.6%
ChartQAPro (araçlı) 72.3% 69.8%
MATEMATİK
AIME 2025 99.8% 98.1% 95.0% 96.0% 91.7%
AIME 2026 95.8% 96.7% 95.1% 97.5% 99.2% 98.3% 91.7% 95.8% 99.2%95.8% 95.3% 95.8% 94.2%
USAMO 2026 97.6% 66.2% 95.2% 74.4%
HMMT 2026 Feb 96.2% 97.7% 94.7% 92.5% 87.8% 95.2%
IMOAnswerBench 75.3% 91.4% 81.0% 91.0% 83.8% 89.8%
Apex 34.5% 54.1% 60.9% 18.4% 38.3%
Apex Shortlist 85.9% 78.1% 89.1% 90.2%
ArxivMath 78.5% 68.7% 71.8% 71.5% 64.8%
RiemannBench 55.0% 43.0% 34.0%
UZUN BAĞLAM (1M Token)
MRCR 1M 92.9% 26.6% 76.3% 83.5%
CorpusQA 1M 71.7% 53.8% 62.0%
GraphWalks BFS 256K 91.1% 85.7% 85.9% 76.9% 38.7% 73.7% 21.4%
GraphWalks Parents 256K 99.96% 99.9% 99.3% 93.6% 90.1%
EYLEMCİ / BİLGİSAYAR KULLANIMI
▸ Tarayıcı & Bilgisayar Kullanımı
BrowseComp 88.0% 87.9% 84.3% 79.8% 83.7% 74.7% 84.4% 82.7% 85.9% 59.2% 83.4% 83.5%
OSWorld-Verified 85.0% 85.4% 83.4% 82.8% 72.7% 66.3% 72.5% 78.7% 75.0% 78.4% 76.2% 70.1%
ScreenSpot-Pro (araçsız) 82.3% 79.5%
ScreenSpot-Pro (araçlı) 87.9% 87.6%
Automation Bench 17.4% 15.5% 9.9% 12.9% 14.5% 9.6%
▸ Araç & Protokol Kullanımı
MCP-Atlas 82.2% 79.1% 75.8% 62.3% 61.3% 75.3% 68.1% 83.6% 78.2% 54.1% 76.8% 74.1% 73.6% 74.2%
Toolathlon 47.2% 55.6% 54.6% 56.5% 48.8% 51.8%
τ2-bench (Retail) 91.9% 88.9% 91.7% 90.8% 85.3%
τ2-bench (Telecom) 99.3% 98.2% 97.9% 98.0% 98.9% 99.3% 98.0%
▸ Alan Eylemcileri (Finans/Hukuk/Ofis)
Finance Agent v1.1 64.4% 60.1% 60.0% 61.5% 59.7%
Finance Agent v2 53.9% 51.5% 51.8% 57.9% 43.0%
Legal Agent (Harvey seti) 13.3% 10.4% 2.1% 0.8% 0.0%
Legal Agent (açık set) 16.9% 13.4% 9.6%
OfficeQA Pro 57.9% 48.1% 52.6% 18.1%
Vending-Bench 2 ($) $5.680 $5.787 $10.937 $8.018 $4.967 $7.204 $7.524 $6.144 $5.396 $911 $5.478 $6.205 $8.314$5.634 $5.115 $3.285 $1.034 $4.663
▸ Belge & İktisadi Görevler
HLE (araçlı) 64.5% 64.7% 57.9% 54.7% 53.3% 49.0% 52.2% 58.7% 51.4% 50.6% 48.2% 44.4%
GDPval-AA (Elo) 1932 1890 1753 1619 1450 1676 1769 1674 1656 1314 1184 1481 15241535 1300 1352 1554 1197 409 1670 1098 1168 990
GDP.pdf (görsel) 29.8% 22.5% 24.9% 16.7%
SİBER GÜVENLİK
CyberGym 83.1% 73.1% 73.8% 81.8% 66.3%
ExploitBench 78.0% 69.0% 40.0% 34.0%