# Tracking Cron and Schedule jobs with Trifle::Stats

## Content
# Tracking Cron and Schedule jobs with `Trifle::Stats`

Recurring jobs are easy to start and annoying to trust.

You add a Sidekiq Cron entry. It runs every minute. It enqueues some work. Everything looks fine until you need to answer a very simple question: did it really run, did it touch the expected resources, and did it keep doing that for every customer?

At DropBot we use `Trifle::Stats` to answer that without writing a custom dashboard for every workflow.

## Cron should be fast

DropBot uses a common scheduling pattern across domains:

1. A global Cron job runs every minute or every few minutes.
2. It finds active tenants that need work.
3. It enqueues a local Schedule job for each tenant.
4. The Schedule job locks that tenant.
5. The Schedule job queues resource jobs for the things it looks after.

The Cron job should be fast. Its job is orchestration, not work.

In DropBot this tenant is a reseller, but the same pattern shows up in many SaaS systems. A tenant might be a customer, account, workspace, store, integration, merchant or project. The global job decides which tenants need attention. The tenant job decides what work belongs to that tenant. The resource jobs do the actual work.

For example, `Commodity::Cron::Resellers::ScheduleJob` finds enabled Commodity resellers and then enqueues `Commodity::Resellers::ScheduleJob` for each of them. The local Schedule job can then look at product targets, batches, synchronizations or whatever else belongs to that reseller.

This split sounds like extra ceremony, but it pays off. Cron stays small and predictable. Scheduling can take longer without blocking the global trigger. Each tenant has its own lock, so we avoid double-enqueueing and can keep the system moving tenant by tenant.

## Track the base classes

The important part is that most of this tracking is not written in every single job.

We have base classes for that.

The Cron base class wraps the global trigger. It measures how often it runs, how long it takes and how many tenants it schedules.

Here is a simplified version:

```ruby
class ApplicationCronJob
  include Sidekiq::Job

  attr_reader :duration, :success, :failure, :limited, :resources_count

  def perform
    @resources_count = 0
    @duration = Benchmark.realtime do
      limiter.within_limit { cron }
    end
    @success = 1
  rescue Sidekiq::Limiter::OverLimit
    @limited = 1
  rescue StandardError
    @failure = 1
    raise
  ensure
    track
  end

  private

  def cron
    raise NotImplementedError
  end

  def enqueue_for_tenant(tenant, flag:, job_class:, args: [])
    Tenant.transaction do
      Tenant.find(tenant.id).update!(flag => Time.zone.now)
      job_class.perform_async(tenant.id, *args)
    end

    incr_counter
  end

  def incr_counter(by: 1)
    @resources_count += by
  end

  def limiter
    Sidekiq::Limiter.concurrent(
      "cron_#{sanitized_name}",
      1,
      wait_timeout: 10,
      lock_timeout: 3600
    )
  end

  def sanitized_name
    self.class.name.underscore.gsub('/', '__')
  end

  def track_values
    {
      duration: duration.to_f.ceil,
      count: 1,
      resources: resources_count.to_i,
      state: {
        success: success.to_i,
        failure: failure.to_i,
        limited: limited.to_i
      }
    }
  end

  def track
    Trifle::Stats.track(
      key: 'event::cron',
      at: Time.zone.now,
      values: {
        total: track_values,
        jobs: {
          sanitized_name => track_values
        }
      }
    )
  end
end
```

A concrete Cron job then only needs to define what "cron" means:

```ruby
class Billing::Cron::ScheduleInvoicesJob < ApplicationCronJob
  private

  def cron
    tenants.find_each do |tenant|
      enqueue_for_tenant(
        tenant,
        flag: :billing_scheduled_at,
        job_class: Billing::Tenants::ScheduleInvoicesJob
      )
    end
  end

  def tenants
    Tenant.active.billing_enabled.where(billing_scheduled_at: nil)
  end
end
```

This gives every Cron job the same tracking shape:

- invocation count
- duration
- tenants scheduled
- success/failure state
- per job breakdown

The Schedule base class wraps the local tenant scheduler. This is where we lock the tenant, count the resources it found and write both global and per-tenant metrics.

```ruby
class ApplicationScheduleJob
  include Sidekiq::Job

  attr_reader :duration, :success, :failure, :limited, :resources_count

  def perform(tenant_id)
    @tenant_id = tenant_id
    @resources_count = 0
    @duration = Benchmark.realtime do
      limiter.within_limit { schedule }
    end
    release_flag
    @success = 1
  rescue Sidekiq::Limiter::OverLimit
    @limited = 1
  rescue StandardError
    release_flag
    @failure = 1
    raise
  ensure
    track
  end

  private

  def schedule
    raise NotImplementedError
  end

  def release_flag
    tenant.update!(tenant_scheduled_flag => nil)
  end

  def tenant_scheduled_flag
    raise NotImplementedError
  end

  def tenant
    @tenant ||= Tenant.find(@tenant_id)
  end

  def incr_counter(by: 1)
    @resources_count += by
  end

  def limiter
    Sidekiq::Limiter.concurrent(
      "schedule_#{tenant_key}_#{sanitized_name}",
      1,
      wait_timeout: 10,
      lock_timeout: 3600
    )
  end

  def sanitized_name
    self.class.name.underscore.gsub('/', '__')
  end

  def tenant_key
    tenant.slug.tr(':', '_')
  end

  def track_values
    {
      duration: duration.to_f.ceil,
      count: 1,
      resources: resources_count.to_i,
      state: {
        success: success.to_i,
        failure: failure.to_i,
        limited: limited.to_i
      }
    }
  end

  def track
    ['event::schedule', "event::schedule::#{tenant_key}"].each do |key|
      Trifle::Stats.track(
        key: key,
        at: Time.zone.now,
        values: {
          total: track_values,
          jobs: {
            sanitized_name => track_values
          },
          tenants: {
            tenant_key => track_values
          }
        }
      )
    end
  end
end
```

And a concrete Schedule job can stay focused on its domain:

```ruby
class Billing::Tenants::ScheduleInvoicesJob < ApplicationScheduleJob
  private

  def schedule
    tenant.invoices.ready_to_sync.find_in_batches(batch_size: 1_000) do |invoices|
      invoices.each do |invoice|
        Billing::SyncInvoiceJob.perform_async(tenant.id, invoice.id)
      end

      incr_counter(by: invoices.size)
    end
  end

  def tenant_scheduled_flag
    :billing_scheduled_at
  end
end
```

This gives every Schedule job the same shape globally and per tenant:

- how many times a scheduler ran
- how many resources it touched
- how long it took
- whether it succeeded, failed or hit a concurrency limit
- which tenant it belonged to

And that is the point. You don't need a complex integration to get useful operational visibility. You need a consistent payload and a place to aggregate it.

The base classes write this to keys like `event::cron`, `event::schedule` and `event::schedule::<tenant>`. The nested values matter:

- `total` shows the aggregate shape of the workflow.
- `jobs` breaks the same numbers down by job class.
- `tenants` breaks Schedule jobs down by tenant.
- `state` stays inside `values`, so success, failure and limited counts can be aggregated like any other metric.

DropBot calls the tenant breakdown `resellers`, because that is the tenant model in that application. The structure is the same.

Once the base class emits it, every Cron and Schedule job gets dashboards almost for free.

The Cron dashboard shows whether the global triggers are running, how often they run and how many resources they schedule:

![Cron dashboard](./2026-05-02-tracking-cron-schedule-jobs/cron.png)

The same pattern also works for periodical jobs that do not belong to a tenant.

In that case we still use the same Cron and Schedule base classes, but the tenant identifier becomes `system`. There is no reseller, account or workspace record to lock and release, so we don't use record-based scheduled flags there. We rely on `Sidekiq::Limiter` for concurrency and track the run under the same metric shape. That lets us visualize system jobs next to tenant jobs, including invocation count, duration, resources touched and success or progress rate.

The Schedule dashboard shows the local execution side, including per-job and per-tenant breakdowns. System jobs appear with `system` as the identifier:

![Schedule dashboard](./2026-05-02-tracking-cron-schedule-jobs/schedule.png)

## What this makes visible

This answers questions that are surprisingly hard to answer from logs alone:

- Did this Cron job run as often as expected?
- Did it find any tenants?
- Did a scheduler run but touch zero resources?
- Did one tenant suddenly enqueue far more work than usual?
- Are failures isolated to one scheduler or global?
- Is duration growing because there is more work or because something got slower?

This is especially useful for quiet failures.

Some failures are loud. They raise exceptions and error tracking catches them. The annoying ones are quiet. Cron not running often enough. Schedule jobs running but not finding resources. A tenant stuck behind its own lock. A source slowly producing fewer resources than expected.

These are time-series problems, not exception problems.

## Trifle App turns it into monitoring

Dashboards are the first step. Open Trifle App, pick the metric key, add a few widgets and now you can see what the system is doing.

But dashboards still require someone to look.

The next step is monitors.

Once the metrics are in Trifle App, you can create alerts for things like:

- Cron invocation count below expected range
- Schedule failures above a threshold
- scheduler resources being zero for too long
- duration drifting upward
- limited state increasing
- one tenant producing unusual volume

This is where the combination becomes useful. `Trifle::Stats` keeps the instrumentation small. Trifle App gives you dashboards, alerts and scheduled reports on top of the same data.

## A few lines in the right place

The best place to add tracking is usually at the boundary:

- Cron starts a workflow
- Schedule decides what should be queued
- Resource jobs report the result
- Batch jobs report how many resources they touched

DropBot's Cron and Schedule tracking is intentionally simple. Count the invocation. Count the resources. Track duration. Track state. Break it down by job and tenant.

That small payload gives enough signal to know if the system changed its behavior.

And when it does, Trifle App can make noise before a customer does.

Ready to get this kind of visibility into your own recurring jobs? [Get started with Trifle](https://app.trifle.io) or [explore the documentation](https://docs.trifle.io).