Intermittent 502 errors returned via the Knock V1 API
Incident Report for Knock
Postmortem

Knock V1 API 502s Incident Postmortem

Summary

From August 28, 2024 at 5:44 p.m. EDT until September 26, 2024 at 3:48 p.m. EDT, the Knock V1 API returned a small but steady number of unexpected HTTP 502 errors. These errors accounted for approximately 0.02% of all Knock V1 API traffic. The root issue was a bug in the HTTP server library that helps power the V1 API.

Timeline

  • August 28, 2024, 5:44 p.m. EDT — Knock engineers ship a release of the V1 API service that includes an update to cowboy, the Erlang HTTP server that powers the Knock V1 API (we write our backend services at Knock in Elixir and use the Phoenix framework as an abstraction above cowboy). This release updated the cowboy version in use at Knock from v2.10.0 to v2.12.0. This new cowboy version includes a bug (see ninenines/cowboy#1654) that causes the V1 API service to start returning a small but steady number of 502 Bad Gateway responses. The V1 API is now returning an average of 7 502s per minute, accounting for ~ 0.02% of all V1 API traffic.
  • September 25, 2024, 12:26 p.m. EDT — Knock Support receives a customer report regarding an increase in 500-level error responses from the V1 API starting in late August. Until this time, Knock engineering has not been aware of the cowboy bug introduced almost a month prior. There are two reasons for this miss:

    • 1) The threshold on Knock’s monitor covering V1 API 500-level error rate is higher than the amount being returned to customers.
    • 2) The cowboy library update shipped on 2024-08-28 was included as part of an update to a separate dependency: Sentry (the error capture tool).
  • 1:24 p.m. EDT — Knock acknowledges the issue and the engineering team declares an incident.

  • 1:52 p.m. EDT — Knock engineers identify the V1 API release on 2024-08-28 that correlates with the increase in 502 errors.

  • 2:07 p.m. EDT — Knock engineers identify the cowboy update included in the Sentry update as the probable culprit for the issue. In addition, engineers identify a ticket filed three months prior (2024-06-17) warning of a possible bug within cowboy v2.12.0 causing 502 errors.

  • 3:40 p.m. EDT — Knock ships a release of the V1 API service that reverts the cowboy library update, pinning it to the last known safe version for our system (v2.10.0).

  • 3:48 p.m. EDT — Knock engineers verify 502s returned from the V1 API have ceased. All clear is given.

Impact during incident

  • August 28, 2024, 5:44 p.m. EDT – September 25, 2024, 3:48 p.m. EDT — The Knock V1 API returns unexpected 502 errors for approximately 0.02% of requests.

Impact post-incident

There was no impact after the incident was resolved.

Ongoing impact

There is no ongoing impact, and the incident has been resolved.

Post-incident action items

  • Knock engineers will audit all V1 API error response monitors to ensure we can proactively detect novel increases in error rates, even if those error rates amount to a small share of total traffic.
  • Knock engineers will prioritize work to migrate the V1 API HTTP server from cowboy to Bandit, a newer library that has recently become the recommended option for Phoenix-based web backends. 
  • Knock engineers will review processes for ensuring known, unsafe dependency updates do not ship to production environments.
Posted Sep 26, 2024 - 15:04 UTC

Resolved
We investigated customer reports of an increase of HTTP 502 errors returned by various endpoints in the V1 API. We identified the root issue as a bug in the HTTP server library we use to power the V1 API. Reverting a recent update to the server library resolved the issue.
Posted Sep 25, 2024 - 17:34 UTC