Back

Claude 3.5 Sonnet Overtakes GPT-4o

Key Points:

  • Sets new industry benchmarks for graduate-level reasoning (GPQA), undergraduate-level knowledge (MMLU), and coding proficiency (HumanEval)
  • Revolutionise your workflow with the new Artifacts feature for seamless, real-time AI-powered collaboration on Claude.ai
  • Operates at twice the speed of Claude 3 Opus
  • Claude 3.5’s better nuanced understanding of humor and complex instructions
  • Cost-effective pricing at $3 per million input tokens and $15 per million output tokens, with a 200K token context window

Anthropic has just dropped a pleasant surprise in the AI world with the release of Claude 3.5 Sonnet, the first model in their upcoming Claude 3.5 family. This isn’t just another update – it’s another leap forward that’s elevating industry standards and raising the stakes in the AI race. Now OpenAI’s flagship GPT-4o model has some proper competition.

Speed and Affordability: The Perfect Package

Claude 3.5 Sonnet is redefining what we thought was possible in AI performance. It outshines leaders like GPT-4o, Gemini 1.5 Pro, and even Anthropic’s own Claude 3 Opus across a wide range of evaluations. But it’s not just about raw intelligence. Claude 3.5 Sonnet has taken major strides in understanding nuance, humour, and complex instructions. Now being more reliable when producing high-quality writing with a natural, relatable tone that many users will appreciate.

Anthropic achieved all this while maintaining the speed and cost-effectiveness – It’s twice as fast as Claude 3 Opus, with pricing that won’t break the bank: $3 per million input tokens and $15 per million output tokens, plus a generous 200K token context window. A context window that OpenAI and Google are still lagging behind. 

Despite the pricing and anecdotal styling improvements, its greatest strengths are in its exceptional reasoning and knowledge capabilities. On the Graduate Prerequisite Questions (GPQA) test measuring graduate-level reasoning, Claude 3.5 Sonnet surpasses GPT-4 Omni’s performance. It also outscores OpenAI’s model on the MMLU benchmark evaluating broad undergrad-level knowledge – showing a deeper grasp of complex topics and nuanced concepts compared to GPT-4 Omni. The model demonstrates sophisticated reasoning skills that could prove invaluable for tasks requiring high-level analysis and problem-solving. Another area where Claude 3.5 Sonnet shines is coding and programming ability. On the HumanEval coding benchmark, it solves an impressive 64% of problems, dwarfing the 38% solved by its Claude 3 Opus predecessor.

Task
Claude 3.5 Sonnet
Claude 3 Opus
GPT-4o
Graduate level reasoning (GFOL Diamond)
59.4%* (0-shot CoT)
50.4% (0-shot CoT)
53.6% (0-shot CoT)
Undergraduate level knowledge (MMLU)
88.7%** (5-shot)
88.3% (0-shot CoT)
86.8% (5-shot)
85.7% (0-shot CoT)

88.7% (0-shot CoT)
Code (HumanEval)
92.0%
84.9% (0-shot)
90.2%
Multilingual math (MGSM)
91.6% (0-shot CoT)
90.7% (0-shot CoT)
90.5%
Reasoning over text (DROP, F1 score)
87.1 (3-shot)
83.1 (3-shot)
83.4
Mixed evaluations (BIG-Bench-Hard)
93.1% (3-shot CoT)
86.8% (3-shot CoT)
Math problem-solving (MATH)
71.1% (0-shot CoT)
60.1% (0-shot CoT)
76.6% (0-shot CoT)
Grade school math (GSM8K)
96.4% (0-shot CoT)
95.0% (0-shot CoT)
Evaluation tests for understanding and critical thinking metrics

Visual Understanding

Another key domain is visual reasoning and extraction, that recently has been the forefront of all multi-modal LLMs. Claude 3.5 Sonnet’s improved performance extends here, achieving superior metrics across most metrics. Now demonstrating superior skills in interpreting charts, graphs and even transcribing text from imperfect images. As multi-modal vision improves, this means more actionable use-cases for industries like retail, logistics and finance that rely heavily on visual data analysis. Having an AI assistant that can quickly and accurately make sense of charts and documents is an important tool – something Claude is shaping up towards. 

Task
Claude 3.5 Sonnet
Claude 3 Opus
GPT-4o
Visual math reasoning (MathVista(testmini))
67.7% (0-shot CoT)
50.5% (0-shot CoT)
63.8% (0-shot CoT)
Science diagrams (AI2D, test)
94.7% (0-shot)
88.1% (0-shot)
94.2% (0-shot)
Visual question answering (MMMU (val))
68.3% (0-shot CoT)
59.4% (0-shot CoT)
69.1% (0-shot CoT)
Chart Q&A (Relaxed accuracy (test))
90.8% (0-shot CoT)
80.8% (0-shot CoT)
85.7% (0-shot CoT)
Document visual Q&A (ANLS score, test)
95.2% (0-shot)
89.3% (0-shot)
92.8% (0-shot)
Visual understanding tests

Interactive Coding in your Browser

Anthropic is taking user experience to the next level with the introduction of Artifacts on Claude.ai. This feature creates a dedicated window alongside your conversation for generated content like code snippets, documents, or website designs. It’s a dynamic workspace that allows users to see, edit, and build upon Claude’s creations in real-time – a game-changer for collaborative AI-assisted work.

How Great is it Really? 

Claude 3.5 undoubtedly showcases increased performance, decreased costs, and more intelligent behavior overall. However, it’s important to keep in mind that while immensely capable, this model is an incremental advancement rather than a paradigm shift in how we utilize large language models. At its core, Claude 3.5 Sonnet is an enhanced and accelerated version of the previous flagship “Opus” model, poised to become the new standard for Claude users. It promises a more seamless, efficient experience across a wide range of applications and use cases. While we’re thoroughly impressed by the positive strides this update has made, we eagerly anticipate what further innovations Anthropic has in store as they continue pushing the boundaries of AI.

Would you love to take your AI use to the next level? With Claude or OpenAI? Fill in our Strategy Quiz below 🔽

Kyriakos Hjikakou
Kyriakos Hjikakou

We use cookies to give you the best experience. Cookie Policy

Preloader image