# Tracking 100M background jobs with Trifle::Stats

## Content
# Tracking 100M background jobs with `Trifle::Stats`

There is this funny phase in every background processing system where everything looks fine because nothing is screaming. Sidekiq is processing. Queues are moving. Errors are not exploding. CPU is not on fire. So naturally you assume it works.

And then you ask a simple question.

How many jobs did we actually run today?

Or more importantly, how many of them did what they were supposed to do?

That is where things get interesting. Sidekiq gives you queue stats. Error tracking gives you exceptions. Logs give you details if you already know what you are looking for. But none of these gives you a clean answer to "is this business process healthy?"

At DropBot we needed that answer for Commodity calculations.

## The calculation pipeline

One of the busiest jobs in DropBot is `Commodity::ProductTargets::CalculateJob`. It calculates product target pricing. It looks at source data, product details, offers, classifications, reseller configuration and then decides what the target should look like.

This is the kind of workload where we track around 90M calculation jobs moving through Sidekiq. Over time we also had product refreshing running at a large scale through `Commodity::Products::Details::RefreshJob` and `Commodity::Products::Classifications::RefreshJob`. That was another ~15M jobs/day for source refreshes.

Since then we moved part of that refresh workload into an API that supports batching. That reduced the number of individual Sidekiq jobs a lot. The work is still there, but its shape changed. Instead of looking only at job counts, we now also care about how many products one batched call touched, how long that batch took, how many products succeeded and how many failed.

That is exactly the kind of thing `Trifle::Stats` is good at.

Not "did Ruby method X run 128ms slower than yesterday", but "is this whole process producing healthy results?"

## What we track

For calculation jobs we track the simple stuff first:

- total number of calculations
- success, failure, skipped, duplicate, deleted and inactive states
- result codes
- duration
- price, margin and estimated delivery day aggregates
- distribution buckets for values that are useful as histograms

The simplified version looks like this:

```ruby
Trifle::Stats.track(
  key: 'commodity::events::product_target::calculation',
  at: Time.zone.now,
  values: {
    count: 1,
    success: 1,
    duration: {
      count: 1,
      sum: duration,
      square: duration**2
    },
    codes: {
      result.status.upcase => 1
    },
    margin: margin_stats,
    price: price_stats,
    edd: edd_stats
  }
)
```

In the real code we write the same event to a few keys:

- global calculation stats
- per customer calculation stats
- per reseller calculation stats
- per product target history

That gives us multiple levels of visibility from the same event. If the global success rate drops, we can ask if it is all customers or one customer. If one reseller looks odd, we can drill further. If a single product keeps recalculating with weird values, we can inspect its timeline.

![Calculation dashboard](./2026-04-25-tracking-100m-background-jobs-with-trifle-stats/calculations.png)

The nice thing is that this data is tiny compared to the work itself. We are not storing every intermediate step inside Stats. We are storing counters, aggregates and distributions in time buckets. That is cheap enough to do often, and useful enough to keep around.

## Result state is more useful than "failed"

One lesson I learned here is that failure is usually too broad.

For calculations, "failed" is not the only interesting state. Some calculations are skipped because required source data is missing. Some are inactive and should not be calculated. Some product targets were deleted before the job ran. Some are duplicates. Some calculated correctly but returned an out-of-stock or unsupported result code.

These are all very different operational signals.

If deleted jobs go up, maybe scheduling is too slow or stale jobs are sitting in queue. If skipped jobs go up, maybe source refreshing is delayed. If a specific result code spikes, maybe a provider changed payload shape or a calculator configuration no longer matches.

So we track them separately. Then the dashboard is not just a big green/red success chart. It becomes a map of what kind of work the system is doing.

## Product refreshes changed shape

Before batching, product refresh jobs were simple to count. One refresh job was one source request. We tracked success, failure, reschedule, duration, source, provider and related product target details.

When we moved details refreshes into a batched API, the interesting unit changed. One job can now refresh many products. So the tracking changed too:

```ruby
Trifle::Stats.track(
  key: 'commodity::events::source::details::all',
  at: Time.zone.now,
  values: {
    count: 1,
    success: job_success,
    failure: job_failure,
    duration: job_duration,
    products: {
      count: store_uids.count,
      success: product_success,
      failure: product_failure,
      duration: products_duration
    }
  }
)
```

This is a good example of why I like tracking at the application level. Infrastructure metrics would still tell you there was one job. Business metrics tell you this job refreshed 100 products, 97 succeeded, 3 failed, and the backlog was growing.

Big difference.

![Source refresh dashboard](./2026-04-25-tracking-100m-background-jobs-with-trifle-stats/refreshes.png)

## What this gives us

This answers questions that are surprisingly hard to answer from logs alone:

- How many calculations did we run?
- What percentage returned a useful result?
- Which result codes are changing?
- Are failures global, customer-specific or reseller-specific?
- Did product refresh batching reduce Sidekiq jobs while keeping product throughput stable?
- Are refresh failures coming from a source, a provider or the batch itself?

That last point is important. When we moved product refreshes to batching, we expected job counts to drop. A lower job count was not a problem. It was the desired outcome. But without tracking `products.count` inside the batched job, a dashboard could make it look like less work was happening.

With `Trifle::Stats`, we can track both. Jobs went down. Products processed stayed visible.

## A few lines in the right place

I don't think every method needs metrics. Thats how dashboards become noise.

The useful places are boundaries:

- Resource job finishes with a result
- External source returns data
- Batch job processes many resources
- Business state changes in a way that matters later

Track these and you start seeing the shape of the system. You don't need perfect observability. You need enough signal to know when the system changed its behavior.

For us, `Trifle::Stats` became the small layer that turns product calculations and refreshes into understandable timelines. `Trifle::Traces` handles the deep dive when we need to inspect one execution. Trifle App gives us dashboards and monitors so the data is not stuck in a console.

And the whole thing started with a few `Trifle::Stats.track` calls.

Thats still my favorite kind of tooling. Small code change. Big operational value.