Going Beyond the Basics - Taking a Deeper Dive into LLM Testing

Chris Billingham

August 5, 2025

Research

Following on from our previous blog post summarising our review of the state of LLM testing in 2025, let's look more closely at what the challenges are for people looking to build with these technologies.

If you're building production systems with LLMs, you've probably discovered that traditional testing approaches start to fall down. What is the fundamental challenge? LLMs exhibit non-deterministic behaviour that can generate a diversity of responses even when presented with identical inputs. This isn't a flaw, it's actually what makes them powerful, but it demands entirely new testing methodologies.

Understanding the Challenge

So let's be honest about what we're dealing with. LLMs present several unique testing challenges that traditional software and data science testing simply wasn't designed for. Outputs depend heavily on context, they exhibit emergent capabilities that weren't explicitly programmed. And perhaps most importantly, they require significant computational resources to test comprehensively.

The relationship between testing costs and query volume creates a direct trade-off that organisations must carefully balance. You can't just run endless test suites like you might with traditional software, every evaluation has a real computational cost attached. In pursuit of testing rigour, no-one wants to come in to work to meet a $$$ cloud charge!

But here's where it gets interesting: the field has evolved sophisticated approaches that work within these constraints rather than against them.

Evaluating What Actually Matters

Human evaluation remains the gold standard for assessing LLM quality, but it's resource-intensive and doesn't scale. Expert evaluators examine outputs across multiple dimensions: factual accuracy, logical coherence, task relevance, and linguistic quality. The challenge is knowing when human evaluation is worth the investment versus when automated approaches suffice.

Traditional metrics like BLEU and ROUGE continue to play important roles, but they have significant limitations. BLEU, for instance, measures n-gram overlap between generated and reference texts which is useful for translation tasks, but it cannot account for synonyms or paraphrases and ignores semantic meaning beyond surface-level word matching.

This is where semantic similarity approaches can really move the needle. BERTScore operates at the token level, computing semantic similarity through embedding alignment and deriving precision, recall, and F1 scores from these alignments. BLEURT takes a more comprehensive approach, employing a two-stage training process that first pretrains on millions of synthetic sentence pairs, then fine-tunes on human ratings to capture complex linguistic intricacies.

So more concretely, this allows BLEURT to recognize that a candidate response is semantically similar to a reference even when it contains more non-reference words than a competing response that BLEU would rank higher purely based on word overlap.

We have also seen the emergence of customizable evaluation frameworks like G-Eval which represent another significant advancement. These systems leverage LLMs' instruction-following capabilities to enable evaluation across arbitrary criteria specified through natural language instructions. For example, G-Eval incorporates chain-of-thought reasoning, breaking down assessments into explicit steps before scoring, which provides transparency and improves reliability.

So we've thought about our evaluation framework but there's often a critical piece that many teams can miss, the quantification of uncertainty. LLM evaluation requires understanding multiple sources of uncertainty, from the evaluation process itself to the inherent variability in model outputs. Point estimates like average accuracy fail to capture this variability. You need approaches that account for the probabilistic nature of these systems, using techniques like clustered standard errors when evaluation questions come in related groups.

Lifting the Lid

While output evaluation tells you what your model produces, intrinsic analysis reveals fundamental characteristics and potential issues even before deployment. This becomes particularly valuable when you need to understand why a model behaves a certain way, rather than just the base outputs.

Text embedding analysis is really key if you're working with any system involving semantic similarity. Encoder models, despite being less prominent in recent discussions, are still among the most widely deployed models, seeing over a billion downloads per month on HuggingFace. The geometric properties of embedding spaces directly impact their effectiveness.

Word embeddings often exhibit consistent characteristics that can impact performance: they typically have non-zero mean vectors, show anisotropic distribution where energy concentrates in low-dimensional subspaces, and their dominant components often encode frequency-based rather than purely semantic information. These complex technical properties need to be properly understood to help evaluate whether embeddings are efficiently distributed in the embedding space.

Weight matrix analysis offers a really fascinating approach to model diagnostics without requiring any test data. Modern LLMs contain thousands of weight matrices with tens of billions of parameters, GPT-3 alone has individual matrices with over a hundred million parameters each. These matrices exhibit complex internal structures that emerge during training.

The Heavy-Tailed Self-Regularization framework applies Random Matrix Theory to analyze these weight matrices, providing metrics that can assess layer quality, detect training anomalies, and even predict model performance. Tools like WeightWatcher make this analysis accessible, offering capabilities like layer quality assessment through alpha metrics, fine-tuning evaluation by examining layer alpha distributions, and training anomaly detection through correlation flow analysis.

Model compression presents another key region for intrinsic analysis. As organizations deploy LLMs into often computationally-constrained environments, techniques like pruning, quantization, knowledge distillation, and low-rank approximation become essential. But compression also introduces trade-offs. Whilst it reduces memory footprint and improves inference speed, it can lead to decreased accuracy and altered knowledge representation within the model. You don't want to squeeze your model into something that works economically, only for it to start outputting nonsense.

The key metrics for evaluating compression impact go beyond simple performance measures to include model size, floating point operations, mean FLOPS utilization, speedup ratios, and compression ratios. Understanding these trade-offs requires systematic evaluation that accounts for both average-case and worst-case performance scenarios.

Building Systems That Last

How LLM testing has evolved also reflects a deeper understanding of these systems' unique characteristics. Rather than trying to force deterministic testing approaches onto probabilistic systems, effective testing frameworks embrace the uncertainty while providing systematic ways to manage and quantify it. Learn to love the uncertainty!

The computational intensity of LLM testing has led to tiered approaches where different levels of evaluation are applied at various stages of development and deployment. This isn't just about managing costs, far from it, it's about building testing strategies that can scale with your system's complexity while maintaining meaningful quality assurance.

What makes this particularly interesting is how intrinsic analysis complements behavioral testing. While you're evaluating outputs and user interactions, weight analysis and embedding diagnostics can reveal underlying issues that might not surface in behavioral tests until much later in deployment.

The evolution of LLM testing reflects something deeper, we're learning to work with these systems' unique characteristics rather than against them. Instead of forcing deterministic approaches onto probabilistic systems, the most effective frameworks embrace uncertainty while giving you systematic ways to manage it.

Of course, there's computational intensity and trade-offs to account for. But there's also incredible potential when you have the right testing foundations in place. Understanding these technical building blocks gives you the tools to make informed decisions and build LLM applications that genuinely deliver value in production

Want to dive deeper into the details? Our comprehensive guide covers all this and more, download it below, and whilst you're there, grab our newsletter for regular insights from Team Etiq on building and maintaining robust AI systems.