AI IRL 25: Evaluating Language Models on Life's Curveballs

There are hundreds of benchmarks and evaluations of LLMs that use trick questions and puzzles to see which models are smarter. But in real life what matters is how AI can help handle common communication challenges - life's most awkward, challenging, or downright bizarre situations. We put the world's top language models through 25 real-world scenarios that would make even the smoothest talker sweat. From crafting the perfect excuse for a missed deadline to penning a heartfelt-yet-firm breakup message, we've got the scoop on which AI truly speaks human.

The Showdown

We pitted Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, and Mistral Large against each other in a battle of words. Our secret weapon? We hired hundreds of expert writers and journalists with decades of experience to judge their responses.

How We Did It

The Prompts

We cherry-picked scenarios where even humans struggle to find the right words and asked four of the top AI models to produce their best response. A few examples:

Explaining to your boss that a coworker isn't pulling their weight
Composing a breakup text that's sincere but not overly dramatic
Inquiring about renting a sloth for a "slowest man wins" charity race (yes, really)

The Evaluation

Our expert panel blind-judged pairs of responses against each other, rating them on quality and identifying personality traits. We then used Elo ratings (like in chess rankings) to measure how the models did against each other and determine the ultimate AI communicator.

Aligned's RLHF (Side by Side Ratings) Eval

The Results Are In!

What the Experts Loved

Across the board, our judges favored responses that were clear, appropriately detailed, and most importantly, struck the right tone.

Wins: Each of the models impressed with their professionalism, warmth, and pragmatism. They often outperformed human-level communication!
Room for Improvement: Humor and imagination. Let's just say AI won't be headlining comedy clubs anytime soon. However, with some additional fine-tuning we think these models could be much more effective in these areas.

Personality Contest

Each AI showed its own communication style:

Claude 3.5 Sonnet: The people-pleaser. Warm, direct, and adaptable.
GPT-4o: The balanced communicator. Clear and warm, but occasionally robotic.
Gemini 1.5 Pro: The efficient friend. Concise yet friendly.
Mistral Large: The detail-oriented ally. Thorough, sometimes to a fault.

And the Winner Is…

Claude 3.5 Sonnet performed the best in our IRL 25 Eval

Claude 3.5 Sonnet took the crown! It consistently nailed the right tone and level of formality across diverse scenarios. While other models sometimes came off as too stiff or casual, Claude found the sweet spot of sounding genuinely human.

Want to See More?

AI Enthusiasts: Follow us for more fascinating AI insights and comparisons.
Businesses: Curious how you can evaluate your own models against the state of the art? Let's chat about how Aligned can set up custom human evaluations for your products and give you insight into how to make them better.
Expert Writers: Join our network and help shape the future of AI communication!

Ready to see which AI speaks your language? Dive into our interactive results or book a demo to see how Aligned can help evaluate your models!

Aligned