Your account may not have processed workflow triggers or integration events for about 22 minutes, between 1:31 PM ET and 1:53 PM ET on May 24, 2023, despite returning API responses indicating that requests were processed successfully.
May 24, 2023 at 17:31 PM UTC: We misconfigured our API as part of an infrastructure update. Immediately, all API calls to trigger workflows, and all source events sent to Knock, were neither recorded nor processed. Other API traffic was not affected.
17:47 PM UTC: Our engineering team was monitoring the rollout as part of our standard deployment procedure. The team had noticed some metrics discrepancies during the rollout, and started running manual tests against the API to confirm those discrepancies.
17:51 PM UTC: Our engineering team concluded that the rollout may be faulty and initiated a rollback.
17:53 PM UTC: The rollback completed and API traffic processing resumed as before.
In order to explore why the outage occurred, we need to establish some context.
At Knock, we constantly improve our platform by adding or extending features, and by investing in our infrastructure. We have been investing in better separation between our core system components to help us more easily horizontally scale certain components independently. One such piece is our notification engine and queue processor, which had historically lived alongside our API ingestion layer.
As such we created a new
processing service to isolate this workload away from the API traffic. Our intention here was to run both our API service and new processing service in parallel before retiring the API services ability process notifications by using application level configuration.
Before the incident, we had scaled up our
processing servers and confirmed that they were properly running workflows & other background processing.
At 17:31 PM UTC, we enabled a configuration change that switched off any queue processing in the API layer.
Our mistake was that this flag not only turned off our Kinesis consumers in the API layer, but it also disabled the in-memory queue that buffered writes out to Kinesis. By disabling this in-memory buffer, incoming API requests were being written to a buffer that was neither accepting writes nor being consumed into Kinesis.
These buffer writes were made asynchronously, without waiting to see if the request had succeeded or failed. The original design intent to the async write was to enhance API throughput. If the writes had been made synchronously, these writes would have failed, giving us the option to retry the API request or raise an alert. However, because the writes were made asynchronously, the writes silently failed, and data was lost, despite the request thinking that it had succeeded and being returned as an HTTP 200 “OK” response. This issue was compounded by the fact that our API log writer also relied on the same buffering mechanism, meaning that all API requests written during this window were also lost.
By 17:53 PM UTC, we had re-rolled back the configuration change. This ended the outage window 22 minutes after it started.
We employ multiple layers of monitoring to detect and prevent issues from reaching this stage. Although we monitor queue reads in our application, and we monitor queue write failures, we did not have monitoring in place to sound an alarm if queue writes fell sharply and unexpectedly with an absence in error messages. Although our engineering team saw metrics drop suddenly, it was not immediately clear if the metrics were delayed or if there was some error in writing metrics. This uncertainty delayed our ability to identify and respond to this incident more promptly.
As part of our standard rollout process, these changes were also deployed to our internal development environment first to allow for internal testing. Internal testing in this case was not thorough enough to detect this edge case, as the platform reported success sending messages under manual testing. We can improve our test procedure to require a workflow processing to complete, and not just an HTTP 200 response returned from the API.
Because this incident was triggered by a manual deployment, our engineering team was monitoring the rollout. Initially, the rollout was handled by a single team member, who then raised concerns during the incident window to get more eyes on the situation. We will consider requiring at least two team members manage each rollout in the future for more rapid issue triage during a potential incident.
Even so, we were able to detect, triage, respond, and resolve the incident within 22 minutes. Monitoring & alerting improvements are already in place to cut this time further if we see this type of issue again.
After correcting the configuration issue, we used our API logs to identify which customers were affected by the outage. Although we did not process a subset of API traffic during the incident, we still recorded the API endpoints being triggered, which gave us enough details to build a list of affected customers. Customers with signed SLAs have been notified of the incident and have a list of affected workflow triggers from during the incident.
Because detecting and mitigating the issue happened within a few minutes of each other, we did not update our status page until after the incident had concluded.
We’re going to be making more investments to ensure that requests to our API are more resilient against failures in writing to our request buffers. We take any dropped data very seriously at Knock, and our goal here is to close any gaps that we currently have to avoid issues like this in the future.
We have added new monitoring to our infrastructure so that we can detect and respond to write failures more rapidly.
In our staging environment, our automated tests should trigger a workflow and watch for a sent notification and associated logs. We have ticketed this along with developing a framework for automated “smoke tests” to run in our staging environment.
When running significant infrastructure changes, we will require at least two team members collaborate on the release process, requiring detailed plans with checklists are developed in advance, approved, and followed during rollout.
Customers depend upon Knock to reliably accept and process API traffic 24/7. Data loss incidents like this are unacceptable, especially when our API was responding as though everything was normal.
Customers like you partner with Knock because you trust us to reliably and consistently deliver great experiences for your products and customers. We deeply value that trust and do not take it for granted. Please reach out if you have any concerns or further questions. You can reach out to me personally at firstname.lastname@example.org.