API ingestion issues caused by misconfiguration

Resolved

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

No components marked as affected

Updates

Write-up published

Read it here

Resolved

Your account may not have processed workflow triggers or integration events for about 22 minutes, between 1:31 PM ET and 1:53 PM ET on May 24, 2023, despite returning API responses indicating that requests were processed successfully.

What happened?

May 24, 2023 at 17:31 PM UTC: We misconfigured our API as part of an infrastructure update. Immediately, all API calls to trigger workflows, and all source events sent to Knock, were neither recorded nor processed. Other API traffic was not affected.
- Our API returned an HTTP 200 “OK” response, making the failure invisible to customer systems, preventing retries.
17:47 PM UTC: Our engineering team was monitoring the rollout as part of our standard deployment procedure. The team had noticed some metrics discrepancies during the rollout, and started running manual tests against the API to confirm those discrepancies.
- At this time, it was not clear if there was an outage or if metrics reporting was temporarily lagging
17:51 PM UTC: Our engineering team concluded that the rollout may be faulty and initiated a rollback.
17:53 PM UTC: The rollback completed and API traffic processing resumed as before.

Why did it happen?

In order to explore why the outage occurred, we need to establish some context.

Architecture changes

At Knock, we constantly improve our platform by adding or extending features, and by investing in our infrastructure. We have been investing in better separation between our core system components to help us more easily horizontally scale certain components independently. One such piece is our notification engine and queue processor, which had historically lived alongside our API ingestion layer.

As such we created a new processing service to isolate this workload away from the API traffic. Our intention here was to run both our API service and new processing service in parallel before retiring the API services ability process notifications by using application level configuration.

The root cause

Before the incident, we had scaled up our processing servers and confirmed that they were properly running workflows & other background processing.

At 17:31 PM UTC, we enabled a configuration change that switched off any queue processing in the API layer.

Our mistake was that this flag not only turned off our Kinesis consumers in the API layer, but it also disabled the in-memory queue that buffered writes out to Kinesis. By disabling this in-memory buffer, incoming API requests were being written to a buffer that was neither accepting writes nor being consumed into Kinesis.

These buffer writes were made asynchronously, without waiting to see if the request had succeeded or failed. The original design intent to the async write was to enhance API throughput. If the writes had been made synchronously, these writes would have failed, giving us the option to retry the API request or raise an alert. However, because the writes were made asynchronously, the writes silently failed, and data was lost, despite the request thinking that it had succeeded and being returned as an HTTP 200 “OK” response. This issue was compounded by the fact that our API log writer also relied on the same buffering mechanism, meaning that all API requests written during this window were also lost.

By 17:53 PM UTC, we had re-rolled back the configuration change. This ended the outage window 22 minutes after it started.

A gap in queue monitoring & testing

We employ multiple layers of monitoring to detect and prevent issues from reaching this stage. Although we monitor queue reads in our application, and we monitor queue write failures, we did not have monitoring in place to sound an alarm if queue writes fell sharply and unexpectedly with an absence in error messages. Although our engineering team saw metrics drop suddenly, it was not immediately clear if the metrics were delayed or if there was some error in writing metrics. This uncertainty delayed our ability to identify and respond to this incident more promptly.

As part of our standard rollout process, these changes were also deployed to our internal development environment first to allow for internal testing. Internal testing in this case was not thorough enough to detect this edge case, as the platform reported success sending messages under manual testing. We can improve our test procedure to require a workflow processing to complete, and not just an HTTP 200 response returned from the API.

Internal & External Response

Because this incident was triggered by a manual deployment, our engineering team was monitoring the rollout. Initially, the rollout was handled by a single team member, who then raised concerns during the incident window to get more eyes on the situation. We will consider requiring at least two team members manage each rollout in the future for more rapid issue triage during a potential incident.

Even so, we were able to detect, triage, respond, and resolve the incident within 22 minutes. Monitoring & alerting improvements are already in place to cut this time further if we see this type of issue again.

After correcting the configuration issue, we used our API logs to identify which customers were affected by the outage. Although we did not process a subset of API traffic during the incident, we still recorded the API endpoints being triggered, which gave us enough details to build a list of affected customers. Customers with signed SLAs have been notified of the incident and have a list of affected workflow triggers from during the incident.

Because detecting and mitigating the issue happened within a few minutes of each other, we did not update our status page until after the incident had concluded.

How will we avoid this in the future?

Improve our API request resiliency

We’re going to be making more investments to ensure that requests to our API are more resilient against failures in writing to our request buffers. We take any dropped data very seriously at Knock, and our goal here is to close any gaps that we currently have to avoid issues like this in the future.

Enhanced monitoring for queue write cases

We have added new monitoring to our infrastructure so that we can detect and respond to write failures more rapidly.

More rigorous automated testing

In our staging environment, our automated tests should trigger a workflow and watch for a sent notification and associated logs. We have ticketed this along with developing a framework for automated “smoke tests” to run in our staging environment.

Checklists & “Pair programming” for specialty ops work

When running significant infrastructure changes, we will require at least two team members collaborate on the release process, requiring detailed plans with checklists are developed in advance, approved, and followed during rollout.

In closing

Customers depend upon Knock to reliably accept and process API traffic 24/7. Data loss incidents like this are unacceptable, especially when our API was responding as though everything was normal.

Customers like you partner with Knock because you trust us to reliably and consistently deliver great experiences for your products and customers. We deeply value that trust and do not take it for granted. Please reach out if you have any concerns or further questions. You can reach out to me personally at chris@knock.app.

Thu, May 25, 2023, 10:25 PM

Resolved

We have already rolled back this configuration change and all data is being processed again as normal as of 1:53pm ET. We'll be sharing a post-mortem shortly.

Wed, May 24, 2023, 05:53 PM(1 day earlier)

Investigating

At 1:31 PM ET (5.31pm UTC) we started to roll out an infrastructure upgrade to improve our ability to scale the Knock API. This upgrade caused an unexpected failure in API processing for 22 minutes until 1:53 PM ET due to a misconfiguration around a critical component of our API ingestion layer.

During this time frame, workflow triggers, and integration events were not processed and were dropped, despite successful API responses. Scheduled workflows, changes made to Knock environments using the CLI or dashboard, and already stored user data were not affected. Any affected workflow triggers or source events sent during this window can be resent to our API to ensure they are processed.

Wed, May 24, 2023, 05:31 PM(22 minutes earlier)