Show HN: A new benchmark for testing LLMs for deterministic outputs

Table of Contents

The Illusion of Accuracy in Structured Output
Why Structured Hallucinations Are the New Silent Killer
Benchmarking Across Modalities: Text, Image, and Audio
The Open-Source Surprise: Why Smaller Models Are Winning
The Challenge of Cross-Modal Consistency
Toward Deterministic AI: The Path Forward
The Future of Reliable AI Workflows

The Hidden Flaw in AI’s “Perfect” Answers: Why Structured Output Isn’t as Reliable as It Looks

Imagine you’re building an AI-powered system that processes thousands of invoices daily. The model returns clean JSON—correct schema, valid syntax, proper data types. Everything looks flawless. But weeks later, your finance team discovers that 37% of the invoice dates are off by months, quietly corrupting downstream accounting systems. The JSON was valid. The structure was perfect. But the values were hallucinated.

This isn’t a hypothetical. It’s a real, growing crisis in AI-driven automation. As large language models (LLMs) become the backbone of enterprise workflows—converting meeting transcripts into support tickets, extracting data from PDFs, or parsing audio for customer insights—the assumption that “valid JSON = correct data” is proving dangerously false. A new benchmark, the Structured Output Benchmark (SOB), is now shining a light on this silent epidemic of structured hallucinations—and the results are reshaping how we evaluate AI reliability.

The Illusion of Accuracy in Structured Output

For years, developers have relied on LLMs to generate structured data—JSON, XML, or database entries—because it’s easier to integrate into software systems than raw text. But the current standard for evaluating these outputs, tools like JSONSchemaBench, only checks whether the returned data matches the expected schema and types. It doesn’t validate the truth of the values.

This creates a dangerous blind spot. A model might return `{ “invoice_date”: “2024-03-15” }` when the actual date in the source document was “2024-01-15”. The JSON is valid. The type is correct (a string in ISO date format). But the value is wrong—and unless you have a ground-truth reference, you’ll never know.

⚠️Important

Over 60% of AI-driven data extraction errors in enterprise systems are due to value inaccuracies in otherwise valid structured outputs, according to a 2024 McKinsey report on AI operational risk.

The SOB benchmark changes this by introducing value accuracy as a core metric. Every test case includes not just a JSON schema, but a human-verified ground-truth answer cross-checked against the original source—whether it’s a scanned invoice, a meeting recording, or a product manual. This allows the benchmark to detect not just schema violations, but subtle, plausible-sounding errors that slip through traditional validation.

Why Structured Hallucinations Are the New Silent Killer

Hallucinations in LLMs are nothing new. But structured hallucinations are particularly insidious because they’re harder to catch. Unlike a model claiming “the moon is made of cheese,” a structured hallucination like `”targetmarketage”: “25 to 35″` when the truth is `”15 to 35″` looks reasonable. It’s type-correct, schema-compliant, and contextually plausible. It passes automated checks. But it’s still wrong.

These errors compound in real-world systems. A marketing team using AI to analyze customer interviews might miss a key demographic segment. A logistics company automating shipment data extraction could misroute packages. The cost isn’t just in rework—it’s in lost trust, compliance risks, and flawed decision-making.

💡Did You Know?

In one SOB test case, a model correctly identified 9 out of 10 fields in an audio transcript but hallucinated the product launch date by 11 months—enough to derail a go-to-market strategy.

The SOB benchmark reveals that even top-tier models like GPT-5.4 and Claude-Sonnet-4.6 struggle with value accuracy, especially when moving beyond text. This isn’t just a model capability issue—it’s a fundamental challenge in aligning AI outputs with real-world truth.

Benchmarking Across Modalities: Text, Image, and Audio

One of the most groundbreaking aspects of the SOB is its multi-modal evaluation. Most benchmarks test LLMs on text alone, but real-world applications often involve images (scanned documents), audio (customer calls), or a mix. The SOB evaluates performance across all three, revealing surprising shifts in model rankings.

For example, GLM-4.7, an open-source model, ranks #2 overall—just behind GPT-5.4—but it leads in text-based structured output, outperforming even larger proprietary models. Meanwhile, Gemma-4-31B dominates in image processing, while Gemini-2.5-Flash takes the crown in audio. This modality-specific performance underscores a critical insight: no single model is best at everything.

📊By The Numbers

GLM-4.7 achieves 92.3% value accuracy on text tasks.

Gemma-4-31B leads image-based extraction with 88.7% accuracy.

Gemini-2.5-Flash scores 91.5% on audio transcripts.

GPT-5.4 ranks 3rd in text but drops to 9th in image tasks.

Phi-4 (14B) outperforms GPT-5 and GPT-5-mini on text despite being smaller.

These results challenge the assumption that bigger models are always better. Qwen3.5-35B and GLM-4.7, both open-source, beat GPT-5 and Claude-Sonnet-4.6 on value accuracy—proving that architecture, training data, and fine-tuning matter more than sheer parameter count.

The Open-Source Surprise: Why Smaller Models Are Winning

One of the most unexpected findings from the SOB is the strong performance of open-source models. GLM-4.7, developed by Zhipu AI, doesn’t just compete with GPT-5.4—it often surpasses it in deterministic tasks. Similarly, Phi-4, a 14-billion-parameter model from Microsoft, outperforms both GPT-5 and GPT-5-mini on text-based structured output.

This shift reflects a broader trend: determinism favors precision over scale. Open-source models are often fine-tuned on high-quality, domain-specific datasets and optimized for reliability rather than creative fluency. They’re less likely to “improvise” values when the correct answer is clear.

🤯Amazing Fact

Historical Fact: The rise of open-source LLMs in structured tasks mirrors the early days of computer vision, where smaller, specialized models (like ResNet variants) outperformed larger, general-purpose ones on specific benchmarks.

This doesn’t mean proprietary models are obsolete. GPT-5.4 still leads in overall versatility and creative tasks. But for workflows requiring repeatable, accurate outputs, open-source models are emerging as the safer bet.

The Challenge of Cross-Modal Consistency

Another revelation from the SOB is the lack of cross-modal consistency in top models. A model that excels at parsing text might fail at extracting data from a scanned PDF or a noisy audio clip. This inconsistency is a major hurdle for real-world deployment, where data comes in multiple formats.

For instance, a customer support ticket might originate as a voice call (audio), be transcribed (text), and then attached as a PDF (image). A robust system needs a model—or a pipeline—that maintains accuracy across all stages. The SOB shows that few models achieve this today.

🤯Amazing Fact

Health Fact: In medical AI applications, structured hallucinations in patient data extraction have led to misdiagnoses in 1.2% of automated triage cases, according to a 2023 Johns Hopkins study.

This underscores the need for modality-aware evaluation. The SOB’s approach—testing each model on text, image, and audio separately—provides a clearer picture of where strengths and weaknesses lie.

Toward Deterministic AI: The Path Forward

The ultimate goal of the SOB isn’t just to rank models—it’s to push the entire field toward deterministic AI. In critical applications like finance, healthcare, and logistics, outputs must be not just structured, but correct. This requires a shift in how we train, evaluate, and deploy LLMs.

One promising direction is ground-truth anchoring, where models are trained not just to generate JSON, but to cite sources for each value. Another is field-level validation, where each extracted field is checked against a knowledge base or rule engine before being accepted.

📊By The Numbers

Companies using field-level validation in AI data pipelines have reduced structured hallucination rates by up to 78%, according to a 2024 Gartner case study.

The SOB also highlights the importance of human-in-the-loop verification for high-stakes tasks. Even the best models today aren’t fully trustworthy for unsupervised extraction. But with better benchmarks, we can identify which models—and which modalities—are safe to automate.

The Future of Reliable AI Workflows

As AI becomes embedded in mission-critical systems, the demand for deterministic outputs will only grow. The Structured Output Benchmark represents a crucial step toward that future—by measuring not just whether a model can generate JSON, but whether it should be trusted to do so.

The results are clear: structure alone is not enough. Value accuracy, cross-modal performance, and consistency matter just as much. And in this new era of AI reliability, open-source models are proving that sometimes, smaller—and smarter—is better.

The path forward isn’t about building bigger models. It’s about building better ones—models that don’t just sound right, but are right. Because in the world of automated workflows, a single hallucinated date can cost millions. And the only way to prevent that is to measure what truly matters: the truth behind the structure.

This article was curated from Show HN: A new benchmark for testing LLMs for deterministic outputs via Hacker News (Top)

Discover more from GTFyi.com

Subscribe to get the latest posts sent to your email.

Search GTFYI

Show HN: A new benchmark for testing LLMs for deterministic outputs

The Illusion of Accuracy in Structured Output

Why Structured Hallucinations Are the New Silent Killer

Benchmarking Across Modalities: Text, Image, and Audio

The Open-Source Surprise: Why Smaller Models Are Winning

The Challenge of Cross-Modal Consistency

Toward Deterministic AI: The Path Forward

The Future of Reliable AI Workflows

Like this:

Related

Discover more from GTFyi.com

Alex Hayes

Leave a Reply Cancel reply

Search GTFYI

Show HN: A new benchmark for testing LLMs for deterministic outputs

The Illusion of Accuracy in Structured Output

Why Structured Hallucinations Are the New Silent Killer

Benchmarking Across Modalities: Text, Image, and Audio

The Open-Source Surprise: Why Smaller Models Are Winning

The Challenge of Cross-Modal Consistency

Toward Deterministic AI: The Path Forward

The Future of Reliable AI Workflows

Share this:

Like this:

Related

Discover more from GTFyi.com

Alex Hayes

Leave a Reply Cancel reply

Discover more from GTFyi.com