# Tracking Cron and Schedule jobs with Trifle::Stats ## Content # Tracking Cron and Schedule jobs with `Trifle::Stats` Recurring jobs are easy to start and annoying to trust. You add a Sidekiq Cron entry. It runs every minute. It enqueues some work. Everything looks fine until you need to answer a very simple question: did it really run, did it touch the expected resources, and did it keep doing that for every customer? At DropBot we use `Trifle::Stats` to answer that without writing a custom dashboard for every workflow. ## Cron should be fast DropBot uses a common scheduling pattern across domains: 1. A global Cron job runs every minute or every few minutes. 2. It finds active tenants that need work. 3. It enqueues a local Schedule job for each tenant. 4. The Schedule job locks that tenant. 5. The Schedule job queues resource jobs for the things it looks after. The Cron job should be fast. Its job is orchestration, not work. In DropBot this tenant is a reseller, but the same pattern shows up in many SaaS systems. A tenant might be a customer, account, workspace, store, integration, merchant or project. The global job decides which tenants need attention. The tenant job decides what work belongs to that tenant. The resource jobs do the actual work. For example, `Commodity::Cron::Resellers::ScheduleJob` finds enabled Commodity resellers and then enqueues `Commodity::Resellers::ScheduleJob` for each of them. The local Schedule job can then look at product targets, batches, synchronizations or whatever else belongs to that reseller. This split sounds like extra ceremony, but it pays off. Cron stays small and predictable. Scheduling can take longer without blocking the global trigger. Each tenant has its own lock, so we avoid double-enqueueing and can keep the system moving tenant by tenant. ## Track the base classes The important part is that most of this tracking is not written in every single job. We have base classes for that. The Cron base class wraps the global trigger. It measures how often it runs, how long it takes and how many tenants it schedules. Here is a simplified version: ```ruby class ApplicationCronJob include Sidekiq::Job attr_reader :duration, :success, :failure, :limited, :resources_count def perform @resources_count = 0 @duration = Benchmark.realtime do limiter.within_limit { cron } end @success = 1 rescue Sidekiq::Limiter::OverLimit @limited = 1 rescue StandardError @failure = 1 raise ensure track end private def cron raise NotImplementedError end def enqueue_for_tenant(tenant, flag:, job_class:, args: []) Tenant.transaction do Tenant.find(tenant.id).update!(flag => Time.zone.now) job_class.perform_async(tenant.id, *args) end incr_counter end def incr_counter(by: 1) @resources_count += by end def limiter Sidekiq::Limiter.concurrent( "cron_#{sanitized_name}", 1, wait_timeout: 10, lock_timeout: 3600 ) end def sanitized_name self.class.name.underscore.gsub('/', '__') end def track_values { duration: duration.to_f.ceil, count: 1, resources: resources_count.to_i, state: { success: success.to_i, failure: failure.to_i, limited: limited.to_i } } end def track Trifle::Stats.track( key: 'event::cron', at: Time.zone.now, values: { total: track_values, jobs: { sanitized_name => track_values } } ) end end ``` A concrete Cron job then only needs to define what "cron" means: ```ruby class Billing::Cron::ScheduleInvoicesJob < ApplicationCronJob private def cron tenants.find_each do |tenant| enqueue_for_tenant( tenant, flag: :billing_scheduled_at, job_class: Billing::Tenants::ScheduleInvoicesJob ) end end def tenants Tenant.active.billing_enabled.where(billing_scheduled_at: nil) end end ``` This gives every Cron job the same tracking shape: - invocation count - duration - tenants scheduled - success/failure state - per job breakdown The Schedule base class wraps the local tenant scheduler. This is where we lock the tenant, count the resources it found and write both global and per-tenant metrics. ```ruby class ApplicationScheduleJob include Sidekiq::Job attr_reader :duration, :success, :failure, :limited, :resources_count def perform(tenant_id) @tenant_id = tenant_id @resources_count = 0 @duration = Benchmark.realtime do limiter.within_limit { schedule } end release_flag @success = 1 rescue Sidekiq::Limiter::OverLimit @limited = 1 rescue StandardError release_flag @failure = 1 raise ensure track end private def schedule raise NotImplementedError end def release_flag tenant.update!(tenant_scheduled_flag => nil) end def tenant_scheduled_flag raise NotImplementedError end def tenant @tenant ||= Tenant.find(@tenant_id) end def incr_counter(by: 1) @resources_count += by end def limiter Sidekiq::Limiter.concurrent( "schedule_#{tenant_key}_#{sanitized_name}", 1, wait_timeout: 10, lock_timeout: 3600 ) end def sanitized_name self.class.name.underscore.gsub('/', '__') end def tenant_key tenant.slug.tr(':', '_') end def track_values { duration: duration.to_f.ceil, count: 1, resources: resources_count.to_i, state: { success: success.to_i, failure: failure.to_i, limited: limited.to_i } } end def track ['event::schedule', "event::schedule::#{tenant_key}"].each do |key| Trifle::Stats.track( key: key, at: Time.zone.now, values: { total: track_values, jobs: { sanitized_name => track_values }, tenants: { tenant_key => track_values } } ) end end end ``` And a concrete Schedule job can stay focused on its domain: ```ruby class Billing::Tenants::ScheduleInvoicesJob < ApplicationScheduleJob private def schedule tenant.invoices.ready_to_sync.find_in_batches(batch_size: 1_000) do |invoices| invoices.each do |invoice| Billing::SyncInvoiceJob.perform_async(tenant.id, invoice.id) end incr_counter(by: invoices.size) end end def tenant_scheduled_flag :billing_scheduled_at end end ``` This gives every Schedule job the same shape globally and per tenant: - how many times a scheduler ran - how many resources it touched - how long it took - whether it succeeded, failed or hit a concurrency limit - which tenant it belonged to And that is the point. You don't need a complex integration to get useful operational visibility. You need a consistent payload and a place to aggregate it. The base classes write this to keys like `event::cron`, `event::schedule` and `event::schedule::`. The nested values matter: - `total` shows the aggregate shape of the workflow. - `jobs` breaks the same numbers down by job class. - `tenants` breaks Schedule jobs down by tenant. - `state` stays inside `values`, so success, failure and limited counts can be aggregated like any other metric. DropBot calls the tenant breakdown `resellers`, because that is the tenant model in that application. The structure is the same. Once the base class emits it, every Cron and Schedule job gets dashboards almost for free. The Cron dashboard shows whether the global triggers are running, how often they run and how many resources they schedule: ![Cron dashboard](./2026-05-02-tracking-cron-schedule-jobs/cron.png) The same pattern also works for periodical jobs that do not belong to a tenant. In that case we still use the same Cron and Schedule base classes, but the tenant identifier becomes `system`. There is no reseller, account or workspace record to lock and release, so we don't use record-based scheduled flags there. We rely on `Sidekiq::Limiter` for concurrency and track the run under the same metric shape. That lets us visualize system jobs next to tenant jobs, including invocation count, duration, resources touched and success or progress rate. The Schedule dashboard shows the local execution side, including per-job and per-tenant breakdowns. System jobs appear with `system` as the identifier: ![Schedule dashboard](./2026-05-02-tracking-cron-schedule-jobs/schedule.png) ## What this makes visible This answers questions that are surprisingly hard to answer from logs alone: - Did this Cron job run as often as expected? - Did it find any tenants? - Did a scheduler run but touch zero resources? - Did one tenant suddenly enqueue far more work than usual? - Are failures isolated to one scheduler or global? - Is duration growing because there is more work or because something got slower? This is especially useful for quiet failures. Some failures are loud. They raise exceptions and error tracking catches them. The annoying ones are quiet. Cron not running often enough. Schedule jobs running but not finding resources. A tenant stuck behind its own lock. A source slowly producing fewer resources than expected. These are time-series problems, not exception problems. ## Trifle App turns it into monitoring Dashboards are the first step. Open Trifle App, pick the metric key, add a few widgets and now you can see what the system is doing. But dashboards still require someone to look. The next step is monitors. Once the metrics are in Trifle App, you can create alerts for things like: - Cron invocation count below expected range - Schedule failures above a threshold - scheduler resources being zero for too long - duration drifting upward - limited state increasing - one tenant producing unusual volume This is where the combination becomes useful. `Trifle::Stats` keeps the instrumentation small. Trifle App gives you dashboards, alerts and scheduled reports on top of the same data. ## A few lines in the right place The best place to add tracking is usually at the boundary: - Cron starts a workflow - Schedule decides what should be queued - Resource jobs report the result - Batch jobs report how many resources they touched DropBot's Cron and Schedule tracking is intentionally simple. Count the invocation. Count the resources. Track duration. Track state. Break it down by job and tenant. That small payload gives enough signal to know if the system changed its behavior. And when it does, Trifle App can make noise before a customer does. Ready to get this kind of visibility into your own recurring jobs? [Get started with Trifle](https://app.trifle.io) or [explore the documentation](https://docs.trifle.io).