Table of Contents
- The Illusion of Accuracy in Structured Output
- Why Structured Hallucinations Are the New Silent Killer
- Benchmarking Across Modalities: Text, Image, and Audio
- The Open-Source Surprise: Why Smaller Models Are Winning
- The Challenge of Cross-Modal Consistency
- Toward Deterministic AI: The Path Forward
- The Future of Reliable AI Workflows
The Hidden Flaw in AI’s “Perfect” Answers: Why Structured Output Isn’t as Reliable as It Looks
Imagine you’re building an AI-powered system that processes thousands of invoices daily. The model returns clean JSON—correct schema, valid syntax, proper data types. Everything looks flawless. But weeks later, your finance team discovers that 37% of the invoice dates are off by months, quietly corrupting downstream accounting systems. The JSON was valid. The structure was perfect. But the values were hallucinated.
This isn’t a hypothetical. It’s a real, growing crisis in AI-driven automation. As large language models (LLMs) become the backbone of enterprise workflows—converting meeting transcripts into support tickets, extracting data from PDFs, or parsing audio for customer insights—the assumption that “valid JSON = correct data” is proving dangerously false. A new benchmark, the Structured Output Benchmark (SOB), is now shining a light on this silent epidemic of structured hallucinations—and the results are reshaping how we evaluate AI reliability.
The Illusion of Accuracy in Structured Output
For years, developers have relied on LLMs to generate structured data—JSON, XML, or database entries—because it’s easier to integrate into software systems than raw text. But the current standard for evaluating these outputs, tools like JSONSchemaBench, only checks whether the returned data matches the expected schema and types. It doesn’t validate the truth of the values.
This creates a dangerous blind spot. A model might return `{ “invoice_date”: “2024-03-15” }` when the actual date in the source document was “2024-01-15”. The JSON is valid. The type is correct (a string in ISO date format). But the value is wrong—and unless you have a ground-truth reference, you’ll never know.
The SOB benchmark changes this by introducing value accuracy as a core metric. Every test case includes not just a JSON schema, but a human-verified ground-truth answer cross-checked against the original source—whether it’s a scanned invoice, a meeting recording, or a product manual. This allows the benchmark to detect not just schema violations, but subtle, plausible-sounding errors that slip through traditional validation.
Why Structured Hallucinations Are the New Silent Killer
Hallucinations in LLMs are nothing new. But structured hallucinations are particularly insidious because they’re harder to catch. Unlike a model claiming “the moon is made of cheese,” a structured hallucination like `”targetmarketage”: “25 to 35″` when the truth is `”15 to 35″` looks reasonable. It’s type-correct, schema-compliant, and contextually plausible. It passes automated checks. But it’s still wrong.
These errors compound in real-world systems. A marketing team using AI to analyze customer interviews might miss a key demographic segment. A logistics company automating shipment data extraction could misroute packages. The cost isn’t just in rework—it’s in lost trust, compliance risks, and flawed decision-making.
The SOB benchmark reveals that even top-tier models like GPT-5.4 and Claude-Sonnet-4.6 struggle with value accuracy, especially when moving beyond text. This isn’t just a model capability issue—it’s a fundamental challenge in aligning AI outputs with real-world truth.
Benchmarking Across Modalities: Text, Image, and Audio
One of the most groundbreaking aspects of the SOB is its multi-modal evaluation. Most benchmarks test LLMs on text alone, but real-world applications often involve images (scanned documents), audio (customer calls), or a mix. The SOB evaluates performance across all three, revealing surprising shifts in model rankings.
For example, GLM-4.7, an open-source model, ranks #2 overall—just behind GPT-5.4—but it leads in text-based structured output, outperforming even larger proprietary models. Meanwhile, Gemma-4-31B dominates in image processing, while Gemini-2.5-Flash takes the crown in audio. This modality-specific performance underscores a critical insight: no single model is best at everything.
Gemma-4-31B leads image-based extraction with 88.7% accuracy.
Gemini-2.5-Flash scores 91.5% on audio transcripts.
GPT-5.4 ranks 3rd in text but drops to 9th in image tasks.
Phi-4 (14B) outperforms GPT-5 and GPT-5-mini on text despite being smaller.
These results challenge the assumption that bigger models are always better. Qwen3.5-35B and GLM-4.7, both open-source, beat GPT-5 and Claude-Sonnet-4.6 on value accuracy—proving that architecture, training data, and fine-tuning matter more than sheer parameter count.
The Open-Source Surprise: Why Smaller Models Are Winning
One of the most unexpected findings from the SOB is the strong performance of open-source models. GLM-4.7, developed by Zhipu AI, doesn’t just compete with GPT-5.4—it often surpasses it in deterministic tasks. Similarly, Phi-4, a 14-billion-parameter model from Microsoft, outperforms both GPT-5 and GPT-5-mini on text-based structured output.
This shift reflects a broader trend: determinism favors precision over scale. Open-source models are often fine-tuned on high-quality, domain-specific datasets and optimized for reliability rather than creative fluency. They’re less likely to “improvise” values when the correct answer is clear.
This doesn’t mean proprietary models are obsolete. GPT-5.4 still leads in overall versatility and creative tasks. But for workflows requiring repeatable, accurate outputs, open-source models are emerging as the safer bet.
The Challenge of Cross-Modal Consistency
Another revelation from the SOB is the lack of cross-modal consistency in top models. A model that excels at parsing text might fail at extracting data from a scanned PDF or a noisy audio clip. This inconsistency is a major hurdle for real-world deployment, where data comes in multiple formats.
For instance, a customer support ticket might originate as a voice call (audio), be transcribed (text), and then attached as a PDF (image). A robust system needs a model—or a pipeline—that maintains accuracy across all stages. The SOB shows that few models achieve this today.
This underscores the need for modality-aware evaluation. The SOB’s approach—testing each model on text, image, and audio separately—provides a clearer picture of where strengths and weaknesses lie.
Toward Deterministic AI: The Path Forward
The ultimate goal of the SOB isn’t just to rank models—it’s to push the entire field toward deterministic AI. In critical applications like finance, healthcare, and logistics, outputs must be not just structured, but correct. This requires a shift in how we train, evaluate, and deploy LLMs.
One promising direction is ground-truth anchoring, where models are trained not just to generate JSON, but to cite sources for each value. Another is field-level validation, where each extracted field is checked against a knowledge base or rule engine before being accepted.
The SOB also highlights the importance of human-in-the-loop verification for high-stakes tasks. Even the best models today aren’t fully trustworthy for unsupervised extraction. But with better benchmarks, we can identify which models—and which modalities—are safe to automate.
The Future of Reliable AI Workflows
As AI becomes embedded in mission-critical systems, the demand for deterministic outputs will only grow. The Structured Output Benchmark represents a crucial step toward that future—by measuring not just whether a model can generate JSON, but whether it should be trusted to do so.
The results are clear: structure alone is not enough. Value accuracy, cross-modal performance, and consistency matter just as much. And in this new era of AI reliability, open-source models are proving that sometimes, smaller—and smarter—is better.
The path forward isn’t about building bigger models. It’s about building better ones—models that don’t just sound right, but are right. Because in the world of automated workflows, a single hallucinated date can cost millions. And the only way to prevent that is to measure what truly matters: the truth behind the structure.
This article was curated from Show HN: A new benchmark for testing LLMs for deterministic outputs via Hacker News (Top)
Discover more from GTFyi.com
Subscribe to get the latest posts sent to your email.
