Production ML

The Uncomfortable Truth About LLM Benchmarks Here’s a fact that might surprise you: GPT-4, Claude 3.5, and Gemini all score nearly identically on traditional fluency metrics. Yet anyone who’s used these models in production knows they behave very differently. Your chatbot hallucinates less with Claude. Your summarization pipeline produces more accurate outputs with GPT-4. Your document extraction works better with a fine-tuned smaller model than any of the giants. So what’s going on? Why are we still chasing BLEU scores and perplexity when they clearly don’t predict real-world performance? ...