How a Single Bug Was Costing Us Hundreds of Dollars a Month (And How We Found It)

A small logic bug was forcing our database to scan millions of rows over and over again—without any alerts firing. This is the story of how we found it, fixed it, and immediately cut our cloud costs in half.

Jan 6, 2026

Matt

Last month, we discovered that a single bug in our codebase was quietly draining our cloud budget. Queries that should have taken milliseconds were running for 12+ seconds—and they were firing dozens of times per minute. Our database CPU was pinned at 50%, we were paying for 12 vCPUs we didn't need, and our monthly bill had ballooned by hundreds of dollars.

Here's how we found it, fixed it, and cut our database costs by more than half overnight.

The Symptoms

Something felt off. Our database instance was consistently running hot—CPU utilization hovering around 50% even during relatively quiet periods. We'd scaled up to 12 vCPUs to handle the load, added more application instances to manage the connection pool, and accepted that this was just the cost of doing business.

But it wasn't. It was the cost of a bug.

Finding the Culprit

We started digging with database insights tooling, which gave us visibility into query performance patterns. The data was concerning: certain queries were averaging 12+ seconds of execution time, and they were being called with alarming frequency—dozens of times per minute.

The tricky part was understanding why. The queries themselves looked reasonable at first glance. This is where we brought in Claude Opus 4.5 to help analyze the query patterns, examine the execution plans, and trace through the code paths that were generating these calls.

Within a couple of hours, we'd identified the root cause. It was subtle.

The Bug

We had a query searching a JSONB column using PostgreSQL's ?| operator, which checks if any of the provided array values exist in a JSONB array field:

SELECT * FROM profiles WHERE document -> 'lookups' ?| array['val1', 'val2']

We had a GIN index on this column specifically to make these lookups fast. And it worked great—when the array had values in it, queries returned in a few milliseconds.

The problem? Sometimes our application was passing an empty array:

SELECT * FROM profiles WHERE document -> 'lookups' ?| array[]::text[]

When the array was empty, PostgreSQL couldn't use the GIN index. Instead, it fell back to a sequential table scan. On a table with millions of rows, that scan took 8-12 seconds.

And due to an upstream logic issue, we were hitting this empty-array code path constantly—dozens of times per minute. Every single one triggered a full table scan.

The Fix

The actual code change was small—a guard clause to skip the query entirely when the filter array was empty (since an empty array means "match nothing" anyway). We pushed to production and watched the dashboards.

The results were immediate and dramatic.

The Numbers

Before the fix:

Average database CPU: ~50%
vCPUs provisioned: 12
Response time p99: spiking to 14+ seconds regularly
Extra application instances needed to handle DB connection pressure

After the fix:

Average database CPU: ~5%
vCPUs provisioned: 6 (and we could probably go lower)
Response time p99: stable, under a second
Monthly savings: hundreds of dollars

We literally watched the CPU graph cliff-dive from sustained 50% utilization down to near-idle within minutes of the deploy.

Lessons Learned

Monitor your query performance, not just your infrastructure metrics. High CPU on a database tells you something is working hard—but not what or why. Query-level insights would have caught this months earlier.

Edge cases can bypass your indexes entirely. We had the right index. Our queries used it—most of the time. But an empty array is a valid input that produces a valid (if useless) query, and PostgreSQL handled it the only way it could: by scanning every row. If you're using GIN indexes with array operators, make sure you're handling the empty-array case in your application code.

AI tooling is genuinely useful for this kind of detective work. Having Claude help analyze query patterns, reason through the code, and sanity-check our hypotheses accelerated the debugging process significantly. It's like pair programming with someone who doesn't get tired of reading execution plans.

Small bugs can have outsized costs. This wasn't a dramatic failure. Nothing crashed. No alerts fired. It was just... expensive. Quietly, consistently expensive. The kind of thing that becomes "normal" if you're not actively looking for it.

Fix the problem, then right-size your infrastructure. We'd scaled up to accommodate the bug. Once the bug was gone, we could scale back down. Don't let your infrastructure bills subsidize your technical debt.

Sometimes the most satisfying wins aren't building new features—they're finding the one line of code that's been silently costing you money and watching the graphs drop to where they should have been all along.

Next up

How we pulled off a massive database migration without anyone noticing ›

Try Rownd for free and accelerate user growth today

Streamline authentication, personalize user experiences, and make updates effortlessly—all without heavy development work. What are you waiting for?

Book a demo

Start now

Try Rownd for free and accelerate user growth today

Streamline authentication, personalize user experiences, and make updates effortlessly—all without heavy development work. What are you waiting for?

Book a demo

Start now

Try Rownd for free and accelerate user growth today

Streamline authentication, personalize user experiences, and make updates effortlessly—all without heavy development work. What are you waiting for?

Book a demo

Start now