Your Code Ran Fine. Your Model Was Still Wrong.

Chris Billingham

April 8, 2026

Blog

You've been debugging for two days. The pipeline runs. The tests pass. The accuracy metrics look solid. And yet, something in production is off. Predictions that made sense last month are drifting.

So you go back to the code. You read it line by line, looking for the bug. You don't find one, because there isn't one. The code is doing exactly what you told it to do. The problem is that what you told it to do and what you needed it to do aren't the same thing.

When the data carries assumptions you didn't make

In ML, your code can execute perfectly while the model learns the wrong thing, because the data itself carried assumptions nobody checked.

A January 2026 Towards Data Science article by practitioner Sudheer Singamsetty lays out a good example from fraud detection. A batch job computes chargeback counts at the end of each day. The resulting feature gets joined back to every transaction from that day, including ones that happened at 9am, hours before the chargeback was even reported. During training, the model sees this future information and learns to use it. Offline recall improves. Nothing looks wrong.

Then the model goes live, where it can only see what's available at the moment of prediction. The future signal vanishes. Performance collapses.

Everything in the pipeline worked as written. The join was correct, the feature calculation was accurate. But the timing assumption baked into the data meant the model had been cheating during training, and nobody noticed until production.

Missing values that become signals

Singamsetty describes another pattern worth paying attention to. In a real-time fraud system, the feature avg_transaction_amount_last_7_days returned zero for new or inactive users, as a default. The model picked up on this: zero transaction history correlated with low fraud, because new users hadn't had time to commit fraud yet. So the model learned that zero meant safe.

Then a downstream service started timing out during peak hours. Active users temporarily lost their transaction history. Their feature values dropped to zero. The model promptly classified them as low risk.

What makes this one frustrating is how reasonable every individual decision was. The default was a sensible engineering choice, the model learned a real statistical pattern from it, and the timeout was handled gracefully. The failure only existed in the relationship between these things, in the fact that nobody had separated "we don't have data" from "the value is zero."

Population shift that monitoring won't catch

There's a subtler version of this problem that can stay hidden for months. Most monitoring setups track feature distributions: if the shape of your input data changes, alarms fire. That works for straightforward data drift.

But what happens when the distributions stay stable and the population underneath changes? Singamsetty saw this in a payments risk system. As the product expanded into new geographies and user segments, transaction amounts and velocity patterns looked the same in aggregate. Dashboards showed green. The fraud rate, though, started creeping up in specific slices of traffic. The model was applying decision boundaries trained on mature users to a growing population of newer users whose similar-looking numbers carried different risk profiles.

This is the kind of failure that erodes trust slowly, because nobody can point to a single thing that broke. The usual reflex, retraining on more recent data, doesn't help much either if the monitoring can't tell you what changed in the first place.

Why this matters beyond the technical team

When something breaks, you look at the code. When the code is fine, you assume the problem is elsewhere, maybe a business process issue, maybe noisy data that will settle down. The possibility that the model is systematically wrong while everything appears to work correctly doesn't always occur to people, especially stakeholders further from the pipeline.

A 2023 study published in Patterns (Kapoor & Narayanan) found that data leakage had affected at least 294 academic papers across 17 scientific fields, in some cases producing wildly overoptimistic conclusions. These were skilled practitioners whose code ran correctly while their data assumptions led them astray. If it happens routinely in peer-reviewed science, it's happening in production ML pipelines too.

Seeing the interplay, not just the code

Reading your code tells you what operations happen. It doesn't tell you whether the data flowing through those operations carries the assumptions you think it does, or whether the logical relationships between your data objects match your intent.

This is why we built Lineage as part of Etiq's Data Science Copilot. Lineage analyses your scripts and visualises the interplay between your data and your code, showing how data objects connect, transform, and flow through your pipeline. The relationships that are invisible when you're reading code top-to-bottom get laid out in front of you.

And once you can see the structure, Etiq's Testing Recommendations help you verify it, suggesting the right tests at the points in your pipeline where assumptions are most likely to go unexamined. You can run them with a click, directly in your IDE.

Your code might be fine. The question is whether your data is telling the truth.

‍