Scheduled workflow delays
Affected components
Notification delivery
Updates

Write-up published

Read it here

Resolved

Knock Scheduled Workflow Delay Incident Postmortem
Summary

Knock experienced degraded performance when executing workflows with delay and batch steps, starting August 5, 2024 at around 10:00 a.m. EDT and lasting until August 5, 2024 6:11 p.m. EDT. During this time, some customers experienced workflow runs with delay or batch steps becoming stuck in a paused/processing state beyond the configured delay times.

Timeline
  • August 5, 2024, ~ 10:00 a.m. EDT — The database-backed job queue Knock uses to execute workflow jobs begins to intermittently fail to stage scheduled jobs for execution. Knock uses scheduled jobs to resume execution for jobs paused on a delay or batch step. When Knock fails to restage a scheduled job, the workflow run becomes stuck forever in a paused processing state.

  • 11:59 a.m. EDT — Knock receives the first customer report of workflow runs with delay steps becoming stuck in a processing state.

  • 12:07 p.m. EDT — Knock Support acknowledges the report and begins investigating.

  • 12:22 p.m. EDT — Knock receives a second customer report of stuck workflow runs.

  • 12:32 p.m. EDT — Knock Support is able to confirm the issue on a couple customer accounts, and is also able to reproduce the intermittent problem. The issue is triaged to Engineering and Knock opens an internal incident.

  • 1:11 p.m. EDT — At Knock, we use eventually consistent logging to generate the workflow run log views available in the Dashboard. Knock engineers spend about 30 minutes working to confirm that this is not purely an issue with logging. Engineering is then able to identify the root issue, that some scheduled jobs are not being picked up for resumed execution.

  • 1:21 p.m. EDT — Knock Status page is updated to indicate degraded performance affecting delayed workflow runs.

  • 2:15 p.m. EDT — Knock engineers deploy a script that identifies scheduled jobs past their resume datetime and manually restages them for execution. This fix ensures all workflow runs resume no more than one minute past their due date. As part of this script, we are also able to stage and resume all the jobs stuck in processing state up until this point. We continue investigating while monitoring this temporary, partial improvement.

  • 3:59 p.m. EDT — Knock engineers identify that database queries used to identify and stage scheduled workflow jobs for resumption are intermittently timing out, leading to the issue with some delay and batching steps never restarting. These timeouts were not immediately visible because they weren’t reported to Sentry, the exception capture tool we use at Knock. Instead, these errors were captured and silently logged within the third-party library \(Oban\) we use to manage this job queue.

  • 5:22 p.m. EDT — Knock deploys an update to the Oban library that contains efficiency improvements for the queries used to restage scheduled jobs. Unfortunately, this has no impact on the issue.

  • 5:50 p.m. EDT — Knock engineers identify that the database table used to store jobs has accumulated over 20 million stale records for completed jobs in the past day. We believe that the sudden growth in size of the table is a likely factor in degraded restage query performance. Knock begins aggressively pruning these stale records.

  • 6:10 p.m. EDT — Knock finishes pruning the 20M\+ stale job records from the database. Knock observes immediate resolution of the issue impacting customers. Knock is able to turn off the force-restage job script without falling behind on scheduled work.

  • 6:11 p.m. EDT — The incident is marked as a resolved on the Knock status page.

Impact during incident
  • ~ 10:00 am – 2:15 pm EDT — A subset of delayed workflow runs never resume, leading to a stuck processing state for the workflow run when viewed in the Knock Dashboard.

  • 2:15 pm – 6:11 pm EDT — A subset of delayed workflow runs resume with a 1-minute delay from their intended resuming time. All workflow runs stuck in a paused processing state up until this point are resumed and completed.

Impact post-incident

There was no impact after the incident was resolved.

Ongoing impact

There is no ongoing impact, and the incident has been resolved.

Post-incident action items
  • We have shipped a more aggressive, automated pruning schedule for stale jobs in our database. This should prevent such a large number of stale jobs from accruing again in the future. In the unlikely event we do accrue too much data again in the future, we have created an on-call runbook for manually pruning stale job data

  • We are setting up monitors to page Knock’s on-call engineer should we observe growth in stale jobs and/or an uptick in query errors when attempting to stage scheduled jobs.

Tue, Aug 6, 2024, 07:51 PM

Resolved

We've fixed the underlying issues with delayed workflows and we're seeing that all workflows are running as normal again.

Mon, Aug 5, 2024, 10:11 PM(21 hours earlier)

Monitoring

We are continuing to monitor for any further issues.

Mon, Aug 5, 2024, 07:29 PM(2 hours earlier)

Monitoring

We're monitoring a fix that we have in place. We're seeing delay times return to normal on workflow execution.

Mon, Aug 5, 2024, 07:28 PM

Identified

We're still higher than usual delayed workflow times right now. We're working to resolve.

Mon, Aug 5, 2024, 06:55 PM(33 minutes earlier)

Identified

We've issued a fix and have ensured we're processing delayed jobs again. We're continuing to investigate the root cause of this issue.

Mon, Aug 5, 2024, 06:25 PM(30 minutes earlier)

Investigating

We are continuing to investigate the issue and are looking into manual mitigation strategies in the short term. We will post an update in the next 30 minutes.

Mon, Aug 5, 2024, 05:57 PM(27 minutes earlier)

Investigating

A subset of workflow runs with delay steps are delayed. We are actively investigating and applying mitigations. We will update the status page as we gather more information.

Mon, Aug 5, 2024, 05:21 PM(35 minutes earlier)