Slow workflow processing and message delivery

Write-up

Summary
Starting at 4:00 AM EDT on Friday, May 1, 2026, Knock began to receive elevated workflow trigger requests from a customer suffering from a security breach which led to a dramatic increase in the load on our services.

Knock systems were not compromised as part of this incident and no data for any other customers were exposed to malicious actors.

By 6:30 AM EDT, Knock was attempting to process roughly 1.5 million workflows per minute for this single customer, reflecting a 7400% increase over our normal baseline for the time period. As a result, Knock’s notification processing and message delivery system was partially degraded. Customers with workflow processing routed to the same queueing infrastructure as the impacted customer experienced delays in notification delivery. All customers experienced delays in message event and workflow run lag data availability in the Knock Dashboard and V1 API. Finally, Knock failed to process 0.02% of workflow triggers between 4 AM – 6:42 AM EDT due to S3 write rate limits.

This incident took longer to acknowledge on our status page than we would like due to a delay in internal escalation and an incorrect initial diagnosis of the root cause.

Timeline

May 1, 2026, 4:00 AM EDT — Knock begins receiving elevated workflow trigger requests from the impacted customer.
4:22 AM EDT — Knock starts failing to process a small number of workflow trigger requests for this customer. The root issue is rate limited write requests to S3, where Knock offloads some of the data used to process notification workflow triggers. The primary on-call engineer is paged and comes online to investigate.
4:30 AM EDT — S3 write rate limit errors begin to impact other customers intermittently. One customer sees a long-running broadcast halt with only 10% of recipients notified due to Knock failing to recover from an S3 write rate limit.
05:00 AM EDT — A subset of Knock notification workflow processing queues become backlogged due to increasing workflow trigger request volume from the impacted customer. In addition, Knock experiences delays in writing message event and workflow run log data to our ClickHouse log data cluster.
05:45 AM EDT — A Knock Platform engineer wakes up and sees that the system is severely degraded. Up until this point, the Knock on-call engineer had been working in isolation attempting to identify and resolve the issue. The on-call engineer mistakenly believed the degradations were due to a known scale-up planned by another customer. However, the Platform engineer quickly identifies this as a separate and more serious issue.
05:54 AM EDT — Knock Platform engineer is fully online and an internal incident is opened.
06:12 AM EDT — Knock Support receives note from customer whose long-running broadcast failed mid-fan-out.
06:19 AM EDT — Knock engineers identify the traffic as coming from a single customer and reflecting a substantial increase over their normal traffic.
06:19 AM EDT — Knock publishes an incident to the public status page.
06:42 AM EDT — After trying to offload the impacted customer’s traffic to isolated, deprioritized queues, Knock engineers realize it is not possible to absorb these traffic patterns without serious, ongoing degradation for all other customers. Thus, Knock elects to begin blocking all new requests from the impacted customer. In addition, Knock begins “noop-ing” all their queued work, bypassing any attempt to actually process and deliver notification messages in favor of rapid load shedding. By this point in time, Knock is attempting to process 1.5 million notification workflows per minute for the impacted customer, which is a 7400% increase over normal baseline for this time of day.
06:45 AM EDT — Knock engineers observe immediate resolution of intermittent S3 write rate limiting errors.
07:00 AM EDT — Knock recovers from delays writing message event and workflow run log data.
08:42 AM EDT — Knock finishes processing through the large backlog of notification workflows queued by the impacted customer. All workflow processing delays for customers on the same queueing infrastructure cease.
09:18 AM EDT — All clear sent to Knock status page following a monitoring period.

Impact during incident

Knock failed to process 294 notification workflow triggers and broadcasts due to the intermittent S3 write rate limiting errors. This amounted to roughly 0.02% of all workflow traffic during the incident timeline.
Knock experienced significant delays processing workflow triggers for customers routed to the same queueing infrastructure as the customer that produced the excess traffic.

Post-incident action items

Knock engineering is reviewing on-call escalation policies to ensure that on-call engineers quickly escalate ongoing issues if they are not able to resolve them in a timely manner.
Knock engineering will work on multiple changes to how we write S3 data for workflow job processing to produce better resilience to intermittent rate limiting errors.