Dashboard data not loading
Incident Report for Knock
Postmortem

Knock Dashboard Degradation Incident Postmortem

Summary

Knock experienced degraded dashboard performance starting June 3, 2024 10:36pm EDT and lasting until June 4, 2024 3:39am EDT.

During this time, the public-facing API did not experience any downtime or degradation; no workflow runs were affected, and all API requests were processed normally.

However, the Knock dashboard was not able to load resources like logs, workflow runs, and user details during this time. In addition, any changes made to dashboard resources during the incident were not being propagated to the public-facing API. The initial issue was resolved at 3:39am EDT, and all customer-facing issues related to re-syncing of dashboard data to the public API were mitigated at 10:49am EDT.

Timeline

  • June 3, 2024

    • 8:37 p.m. EDT: The AWS Ingress Controller component of our Kubernetes control plane experienced a transient issue causing it to restart. Engineers were paged and investigated. No changes in our runtime service were noted at that time, and the monitor resolved itself.
    • 10:36 p.m. EDT: The AWS Ingress controller performed a resynchronization and surfaced a latent issue in our internal service traffic routing configuration. Because this routing configuration issue was only for internal traffic, it prevented our Dashboard from communicating with our API data plane, preventing the loading of logs & synchronizing workflow changes. Primary public API traffic was not affected, however changes to workflow configurations were not synchronized until manually retried later (see below). Knock’s monitoring infrastructure had heretofore focused on external API access rather than internal service availability, leaving a blind spot in our monitoring & paging process for this type of issue.
    • 11:27 p.m. EDT: Knock receives the first of four customer mentions of the issue, but does not immediately respond due to lack of internal alerting.
  • June 4, 2024

    • 2:18 a.m. EDT: Knock acknowledges the incident and begins investigating.
    • 2:25 a.m. EDT: Knock’s on-call platform engineer is paged and comes online.
    • 2:38 a.m. EDT: Knock Status page is updated to indicate a known incident.
    • 3:00 a.m. EDT: Knock determines that the public-facing API remains fully operational.
    • 3:22 a.m. EDT: Knock determines source of the issue, being that the Kubernetes service for the Knock API was not available for requests made internal to Knock’s platform (e.g. Dashboard requests to the API service).
    • 3:29 a.m. EDT: Knock identifies a temporary mitigation, to direct Dashboard traffic to our public API endpoint instead of over the misconfigured internal network. Engineers work to apply this mitigation.
    • 3:39 a.m. EDT: This temporary mitigation is rolled out, and the primary outage is resolved for loading logs & other workflow information. Workflow changes made from the beginning of the incident (June 3 @ 10:36 p.m. EDT) remain unsynchronized.
    • 3:52 a.m. EDT: Status page is updated to reflect restored dashboard functionality.
    • 4:01 a.m. EDT: Engineers apply a permanent mitigation for the underlying issue, correcting the internal service routing misconfiguration. Dashboard performance is fully restored, however dashboard actions applied during the incident time frame remain unsynchronized.
    • 4:00 a.m. EDT until 9:49 a.m. EDT: Engineers continue to monitor service performance to ensure Dashboard is fully restored.
    • 9:49 a.m. EDT: Engineering adds an internal monitor to detect and respond to internal service availability moving forward. Platform configuration notes are added to internal runbooks for future reference.
    • 10:32 a.m. EDT: Knock identifies residual customer impact due to dashboard actions which failed to sync to the public API during the incident.
    • 10:38 a.m. EDT: Knock identifies all synchronization jobs that were delayed due to the incident.
    • 10:49 a.m. EDT: Knock retries all delayed jobs, and determines that customer impact has been fully mitigated.

Impact during incident

June 3, 2024 10:36 p.m. EDT - June 4, 2024 3:40 a.m. EDT: During this incident, an issue with our internal Kubernetes networking layer caused internal traffic between Knock’s dashboard and our API service to fail. This had the following impacts:

  • Requests from the dashboard to our internal systems to load resources such as Users, Messages, and Workflow runs in the dashboard failed
  • Updates made in the dashboard to workflows and other resources were not propagated to the public-facing API

While Knock was operating in a degraded state, there were no outages to our public-facing API or notification delivery.

Impact post-incident

June 3, 2024 10:36 p.m. EDT - June 4, 2024 10:49 a.m. EDT: Any dashboard actions which failed during the incident resulted in our public-facing API (and its workflow versions, etc) being out of sync with the dashboards of customers who took those actions. These actions were resynced, resulting in full mitigation of the incident at 10:49am EDT.

Ongoing impact

There is no ongoing impact, and the incident has been resolved.

Post-incident action items

  • We have updated our monitoring system to detect this particular edge case moving forward.
  • We identified the root cause of the internal service communication outage, and have resolved that issue permanently.
  • We have updated our internal runbooks to document additional edge cases around our internal kubernetes networking layer.
Posted Jun 04, 2024 - 20:38 UTC

Resolved
This incident has been resolved.
Posted Jun 04, 2024 - 09:46 UTC
Update
We've identified the source of the issue and applied the hot fix. All data loading should be restored and dashboard back to fully operational.
Posted Jun 04, 2024 - 07:52 UTC
Update
We are continuing to investigate this issue.
Posted Jun 04, 2024 - 06:39 UTC
Investigating
We're currently investigating an issue where certain data in dashboard is currently not loading, including messages.
Posted Jun 04, 2024 - 06:38 UTC
This incident affected: Knock systems (Dashboard).